SUSE-FU-2023:3321-1: moderate: Feature update for slurm_23_02 and pdsh

sle-updates at lists.suse.com sle-updates at lists.suse.com
Tue Aug 15 12:30:41 UTC 2023



# Feature update for slurm_23_02 and pdsh

Announcement ID: SUSE-FU-2023:3321-1  
Rating: moderate  
References:

  * #1088693
  * #1206795
  * #1208846
  * #1209216
  * #1209260
  * #1212946
  * PED-2987

  
Affected Products:

  * openSUSE Leap 15.4
  * openSUSE Leap 15.5
  * SUSE Linux Enterprise High Performance Computing 15 SP2
  * SUSE Linux Enterprise High Performance Computing 15 SP2 LTSS 15-SP2

  
  
An update that contains one feature and has six feature fixes can now be
installed.

## Description:

This update for slurm_23_02 and pdsh fixes the following issues:

slurm_23_02 - New version upgrade of Slurm to 23.02 (jsc#PED-2987):

  * For the full list of new features and changes please consult the packaged
    NEWS file and the following references:
  * 23.02.2
  * 23.02.1
  * 23.02.0
  * Important notes:
  * If using the `slurmdbd` (Slurm DataBase Daemon) you must update this first.
  * If using a backup DBD you must start the primary first to do any database
    conversion, the backup will not start until this has happened.
  * The 23.02 `slurmdbd` will work with Slurm daemons of version 21.08 and
    above. You will not need to update all clusters at the same time, but it is
    very important to update `slurmdbd` first and having it running before
    updating any other clusters making use of it.
  * Slurm can be upgraded from version 21.08 or 22.05 to version 23.02 without
    loss of jobs or other state information. Upgrading directly from an earlier
    version of Slurm will result in loss of state information.
  * All SPANK plugins must be recompiled when upgrading from any Slurm version
    prior to 23.02
  * PMIx v1.x is no longer supported
  * Packaging patches and changes:
  * Only call slurm_init() if Slurm > 21.02 (bsc#1212946)
  * Web-configurator: changed presets to SUSE defaults.
  * Use libpmix.so.2 instead of libpmix.so to fix (bsc#1209260) this removes the
    need of pmix-pluginlib
  * `slurm-plugins` need to require `pmix-pluginlib` (bsc#1209260)
  * Remove workaround to fix the restart issue in an Slurm package described in
    bsc#1088693 The Slurm version in this package is 16.05. Any attempt to
    directly migrate to the current version is bound to fail
  * Now require `slurm-munge` if `munge` authentication is installed
  * testsuite: on later SUSE versions claim ownership of directory
    `/etc/security/limits.d`
  * Move the ext_sensors/rrd plugin to a separate package: this plugin requires
    `librrd` which in turn requires huge parts of the client side X Window
    System stack. There is probably no use in cluttering up a system for a
    plugin that probably only used by a few
  * Configuration file changes:
  * `job_container.conf` \- Added "`Dirs`" option to list desired private mount
    points
  * `node_features` plugins - invalid users specified for `AllowUserBoot` will
    now result in `fatal()` rather than just an error
  * Allow jobs to queue even if the user is not in `AllowGroups` when
    `EnforcePartLimits=no` is set. This ensures consistency for all the
    Partition access controls, and matches the documented behavior for
    `EnforcePartLimits`
  * Add `InfluxDBTimeout` parameter to `acct_gather.conf`
  * `job_container/tmpfs` \- add support for expanding `%h` and `%n` in
    `BasePath`
  * `slurm.conf` \- Removed `SlurmctldPlugstack` option
  * Add new `SlurmctldParameters=validate_nodeaddr_threads=<number>`
    option to allow concurrent hostname resolution at `slurmctld` startup
  * Add new `AccountingStoreFlags=job_extra` option to store a job's extra field
    in the database
  * Add new "`defer_batch`" option to `SchedulerParameters` to only defer
    scheduling for batch jobs
  * Add new `DebugFlags` option '`JobComp`' to replace '`Elasticsearch`'
  * Add configurable job requeue limit parameter - `MaxBatchRequeue` \- in
    `slurm.conf` to permit changes from the old hard-coded value of 5
  * `helpers.conf` \- Allow specification of node specific features
  * `helpers.conf` \- Allow many features to one helper script
  * `job_container/tmpfs` \- Add "`Shared`" option to support shared namespaces.
    This allows autofs to work with the `job_container/tmpfs` plugin when
    enabled
  * `acct_gather.conf` \- Added `EnergyIPMIPowerSensors=Node=DCMI` and
    `Node=DCMI_ENHANCED`.
  * Add new "`getnameinfo_cache_timeout=<number>`" option to
    CommunicationParameters to adjust or disable caching the results of
    `getnameinfo()`
  * Add new PrologFlags=ForceRequeueOnFail option to automatically requeue batch
    jobs on Prolog failures regardless of the job --requeue setting
  * Add `HealthCheckNodeState=NONDRAINED_IDLE` option.
  * Add '`explicit`' to Flags in `gres.conf`. This makes it so the gres is not
    automatically added to a job's allocation when `--exclusive` is used. Note
    that this is a per-node flag.
  * Moved the "`preempt_`" options from `SchedulerParameters` to
    `PreemptParameters`, and dropped the prefix from the option names. (The old
    options will still be parsed for backwards compatibility, but are now
    undocumented.)
  * Add `LaunchParameters=ulimit_pam_adopt`, which enables setting `RLIMIT_RSS`
    in adopted processes.
  * Update SwitchParameters=job_vni to enable/disable creating job VNIs for all
    jobs, or when a user requests them
  * Update `SwitchParameters=single_node_vni` to enable/disable creating single
    node VNIs for all jobs, or when a user requests them
  * Add ability to preserve `SuspendExc*` parameters on reconfig with
    `ReconfigFlags=KeepPowerSaveSettings`
  * `slurmdbd.conf` \- Add new `AllResourcesAbsolute` to force all new resources
    to be created with the `Absolute` flag
  * `topology/tree` \- Add new `TopologyParam=SwitchAsNodeRank` option to
    reorder nodes based on switch layout. This can be useful if the naming
    convention for the nodes does not natually map to the network topology
  * Removed the default setting for `GpuFreqDef`. If unset, no attempt to change
    the GPU frequency will be made if `--gpu-freq` is not set for the step
  * Command Changes:
  * `sacctmgr` \- no longer force updates to the AdminComment, Comment, or
    SystemComment to lower-case
  * `sinfo` \- Add -F/--future option to sinfo to display future nodes.
  * `sacct` \- Rename 'Reserved' field to 'Planned' to match sreport and the
    nomenclature of the 'Planned' node
  * `scontrol` \- advanced reservation flag MAINT will no longer replace nodes,
    similar to STATIC_ALLOC
  * `sbatch` \- add parsing for #PBS -d and #PBS -w.
  * `scontrol` show assoc_mgr will show username(uid) instead of uid in QoS
    section.
  * Add `strigger --draining` and `-R/--resume` options.
  * Change `--oversubscribe` and `--exclusive` to be mutually exclusive for job
    submission. Job submission commands will now fatal if both are set.
    Previously, these options would override each other, with the last one in
    the job submission command taking effect.
  * `scontrol` \- Requested TRES and allocated TRES will now always be printed
    when showing jobs, instead of one TRES output that was either the requested
    or allocated.
  * `srun --ntasks-per-core` now applies to job and step allocations. Now, use
    of `--ntasks-per-core=1` implies `--cpu-bind=cores` and `--ntasks-per-
    core>1` implies `--cpu-bind=threads`.
  * `salloc/sbatch/srun` \- Check and abort if `ntasks-per-core` > `threads-per-
    core`.
  * `scontrol` \- Add `ResumeAfter=<secs>` option to "scontrol update
    nodename=".
  * Add a new "nodes=" argument to scontrol setdebug to allow the debug level on
    the slurmd processes to be temporarily altered
  * Add a new "nodes=" argument to "scontrol setdebugflags" as well.
  * Make it so `scrontab` prints client-side the job_submit() err_msg (which can
    be set i.e. by using the log_user() function for the lua plugin).
  * `scontrol` \- Reservations will not be allowed to have STATIC_ALLOC or MAINT
    flags and REPLACE[_DOWN] flags simultaneously
  * `scontrol` \- Reservations will only accept one reoccurring flag when being
    created or updated.
  * `scontrol` \- A reservation cannot be updated to be reoccurring if it is
    already a floating reservation.
  * `squeue` \- removed unused '%s' and 'SelectJobInfo' formats.
  * `squeue` \- align print format for exit and derived codes with that of other
    components (<exit_status>:<signal_number>).
  * `sacct` \- Add --array option to expand job arrays and display array tasks
    on separate lines.
  * Partial support for `--json` and `--yaml` formated outputs have been
    implemented for `sacctmgr`, `sdiag`, `sinfo`, `squeue`, and `scontrol`. The
    resultant data ouput will be filtered by normal command arguments.
    Formatting arguments will continue to be ignored.
  * `salloc/sbatch/srun` \- extended the `--nodes` syntax to allow for a list of
    valid node counts to be allocated to the job. This also supports a "step
    count" value (e.g., --nodes=20-100:20 is equivalent to
    --nodes=20,40,60,80,100) which can simplify the syntax when the job needs to
    scale by a certain "chunk" size
  * `srun` \- add user requestible vnis with '\--network=job_vni' option
  * `srun` \- add user requestible single node vnis with the
    `--network=single_node_vni` option
  * API Changes:
  * `job_container` plugins - `container_p_stepd_create()` function signature
    replaced `uint32_t` uid with `stepd_step_rec_t*` step.
  * `gres` plugins - `gres_g_get_devices()` function signature replaced `pid_t
    pid` with `stepd_step_rec_t*` step.
  * `cgroup` plugins - `task_cgroup_devices_constrain()` function signature
    removed `pid_t pid`.
  * `task` plugins - `replace task_p_pre_set_affinity()`,
    `task_p_set_affinity()`, and `task_p_post_set_affinity()` with
    `task_p_pre_launch_priv()` like it was back in slurm 20.11.
  * Allow for concurrent processing of `job_submit_g_submit()` and
    `job_submit_g_modify()` calls. If your plugin is not capable of concurrent
    operation you must add additional locking within your plugin.
  * Removed return value from slurm_list_append().
  * The List and ListIterator types have been removed in favor of list_t and
    list_itr_t respectively.
  * burst buffer plugins:
    * add `bb_g_build_het_job_script()`
    * `bb_g_get_status()` \- added authenticated UID and GID
    * `bb_g_run_script()` \- added job_info argument
  * `burst_buffer.lua` \- Pass UID and GID to most hooks. Pass `job_info`
    (detailed job information) to many hooks. See `etc/burst_buffer.lua.example`
    for a complete list of changes. _WARNING_ : Backwards compatibility is
    broken for `slurm_bb_get_status`: UID and GID are passed before the variadic
    arguments. If UID and GID are not explicitly listed as arguments to
    `slurm_bb_get_status()`, then they will be included in the variadic
    arguments. Backwards compatibility is maintained for all other hooks because
    the new arguments are passed after the existing arguments.
  * `node_features plugins` changes:
    * `node_features_p_reboot_weight()` function removed.
    * `node_features_p_job_valid()` \- added parameter feature_list.
    * `node_features_p_job_xlate()` \- added parameters feature_list and `job_node_bitmap`
  * New `data_parser` interface with v0.0.39 plugin
  * Test Suite fixes:
  * Update README_Testsuite.md
  * Clean up left over files when de-installing test suite
  * Adjustment to test suite package: for SLE mark the openmpi4 devel package
    and slurm-hdf5 optional
  * Add `-ffat-lto-objects` to the build flags when LTO is set to make sure the
    object files we ship with the test suite still work correctly.
  * Improve `setup-testsuite.sh`: copy ssh fingerprints from all nodes

pdsh:

  * Prepared `pdsh` for Slurm 23.02 (jsc#PED-2987)
  * Fix slurm plugin: make sure slurm_init() is called before using the Slurm
    API (bsc#1209216)
  * Fix regression in Slurm 23.02 breaking the pdsh-internal List type by
    exposing it thru it's public API (bsc#1208846)
  * Backport a number of features and fixes (bsc#1206795):
  * Add '-C' option on Slrum plugin to restrict selected nodes to ones with the
    specified features present
  * Add option '-k' to the ssh plugin to fail faster on connection failures
  * Fix use of `strchr`
  * `dshbak`: Fix uninitialized use of $tag on empty input
  * `dsh`: Release a lock that is no longer used in dsh()

## Patch Instructions:

To install this SUSE Moderate update use the SUSE recommended installation
methods like YaST online_update or "zypper patch".  
Alternatively you can run the command listed for your product:

  * openSUSE Leap 15.4  
    zypper in -t patch openSUSE-SLE-15.4-2023-3321=1

  * openSUSE Leap 15.5  
    zypper in -t patch openSUSE-SLE-15.5-2023-3321=1

  * SUSE Linux Enterprise High Performance Computing 15 SP2 LTSS 15-SP2  
    zypper in -t patch SUSE-SLE-Product-HPC-15-SP2-LTSS-2023-3321=1

## Package List:

  * openSUSE Leap 15.4 (aarch64 ppc64le s390x x86_64)
    * pdsh-slurm_20_11-debuginfo-2.34-150200.4.11.1
    * pdsh-slurm_20_11-2.34-150200.4.11.1
    * pdsh_slurm_20_11-debugsource-2.34-150200.4.11.1
  * openSUSE Leap 15.5 (aarch64 ppc64le s390x x86_64)
    * pdsh-slurm_20_11-debuginfo-2.34-150200.4.11.1
    * pdsh-slurm_20_11-2.34-150200.4.11.1
    * pdsh_slurm_20_11-debugsource-2.34-150200.4.11.1
  * SUSE Linux Enterprise High Performance Computing 15 SP2 LTSS 15-SP2 (aarch64
    x86_64)
    * pdsh-slurm_22_05-debuginfo-2.34-150200.4.11.1
    * slurm_23_02-sview-23.02.2-150200.5.3.1
    * pdsh-dshgroup-2.34-150200.4.11.1
    * pdsh-slurm_22_05-2.34-150200.4.11.1
    * slurm_23_02-plugins-23.02.2-150200.5.3.1
    * perl-slurm_23_02-23.02.2-150200.5.3.1
    * slurm_23_02-plugin-ext-sensors-rrd-debuginfo-23.02.2-150200.5.3.1
    * slurm_23_02-munge-debuginfo-23.02.2-150200.5.3.1
    * slurm_23_02-pam_slurm-debuginfo-23.02.2-150200.5.3.1
    * pdsh-slurm-2.34-150200.4.11.1
    * slurm_23_02-sview-debuginfo-23.02.2-150200.5.3.1
    * slurm_23_02-cray-debuginfo-23.02.2-150200.5.3.1
    * slurm_23_02-slurmdbd-debuginfo-23.02.2-150200.5.3.1
    * slurm_23_02-lua-23.02.2-150200.5.3.1
    * pdsh-netgroup-debuginfo-2.34-150200.4.11.1
    * pdsh-dshgroup-debuginfo-2.34-150200.4.11.1
    * pdsh-genders-2.34-150200.4.11.1
    * slurm_23_02-cray-23.02.2-150200.5.3.1
    * slurm_23_02-munge-23.02.2-150200.5.3.1
    * libpmi0_23_02-23.02.2-150200.5.3.1
    * libnss_slurm2_23_02-debuginfo-23.02.2-150200.5.3.1
    * libslurm39-debuginfo-23.02.2-150200.5.3.1
    * pdsh-genders-debuginfo-2.34-150200.4.11.1
    * slurm_23_02-23.02.2-150200.5.3.1
    * libnss_slurm2_23_02-23.02.2-150200.5.3.1
    * pdsh-machines-debuginfo-2.34-150200.4.11.1
    * libpmi0_23_02-debuginfo-23.02.2-150200.5.3.1
    * pdsh-debugsource-2.34-150200.4.11.1
    * slurm_23_02-node-debuginfo-23.02.2-150200.5.3.1
    * slurm_23_02-auth-none-debuginfo-23.02.2-150200.5.3.1
    * slurm_23_02-pam_slurm-23.02.2-150200.5.3.1
    * slurm_23_02-plugin-ext-sensors-rrd-23.02.2-150200.5.3.1
    * libslurm39-23.02.2-150200.5.3.1
    * slurm_23_02-slurmdbd-23.02.2-150200.5.3.1
    * slurm_23_02-auth-none-23.02.2-150200.5.3.1
    * slurm_23_02-sql-23.02.2-150200.5.3.1
    * slurm_23_02-node-23.02.2-150200.5.3.1
    * slurm_23_02-rest-debuginfo-23.02.2-150200.5.3.1
    * pdsh-netgroup-2.34-150200.4.11.1
    * pdsh-2.34-150200.4.11.1
    * slurm_23_02-torque-debuginfo-23.02.2-150200.5.3.1
    * slurm_23_02-debuginfo-23.02.2-150200.5.3.1
    * perl-slurm_23_02-debuginfo-23.02.2-150200.5.3.1
    * slurm_23_02-debugsource-23.02.2-150200.5.3.1
    * pdsh-machines-2.34-150200.4.11.1
    * pdsh_slurm_22_05-debugsource-2.34-150200.4.11.1
    * slurm_23_02-devel-23.02.2-150200.5.3.1
    * slurm_23_02-lua-debuginfo-23.02.2-150200.5.3.1
    * pdsh-debuginfo-2.34-150200.4.11.1
    * slurm_23_02-plugins-debuginfo-23.02.2-150200.5.3.1
    * slurm_23_02-sql-debuginfo-23.02.2-150200.5.3.1
    * slurm_23_02-torque-23.02.2-150200.5.3.1
    * pdsh-slurm-debuginfo-2.34-150200.4.11.1
    * pdsh-slurm_23_02-debuginfo-2.34-150200.4.11.1
    * pdsh-slurm_23_02-2.34-150200.4.11.1
    * slurm_23_02-rest-23.02.2-150200.5.3.1
  * SUSE Linux Enterprise High Performance Computing 15 SP2 LTSS 15-SP2 (noarch)
    * slurm_23_02-doc-23.02.2-150200.5.3.1
    * slurm_23_02-webdoc-23.02.2-150200.5.3.1
    * slurm_23_02-config-man-23.02.2-150200.5.3.1
    * slurm_23_02-config-23.02.2-150200.5.3.1

## References:

  * https://bugzilla.suse.com/show_bug.cgi?id=1088693
  * https://bugzilla.suse.com/show_bug.cgi?id=1206795
  * https://bugzilla.suse.com/show_bug.cgi?id=1208846
  * https://bugzilla.suse.com/show_bug.cgi?id=1209216
  * https://bugzilla.suse.com/show_bug.cgi?id=1209260
  * https://bugzilla.suse.com/show_bug.cgi?id=1212946
  * https://jira.suse.com/browse/PED-2987

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.suse.com/pipermail/sle-updates/attachments/20230815/02686fa9/attachment.htm>


More information about the sle-updates mailing list