SUSE-RU-2023:4336-1: moderate: Recommended update for slurm_23_02

Thu Nov 2 08:30:03 UTC 2023

# Recommended update for slurm_23_02

Announcement ID: SUSE-RU-2023:4336-1  
Rating: moderate  
References:

  * bsc#1215437

Affected Products:

  * HPC Module 12
  * SUSE Linux Enterprise High Performance Computing 12 SP2
  * SUSE Linux Enterprise High Performance Computing 12 SP3
  * SUSE Linux Enterprise High Performance Computing 12 SP4
  * SUSE Linux Enterprise High Performance Computing 12 SP5
  * SUSE Linux Enterprise Server 12 SP2
  * SUSE Linux Enterprise Server 12 SP3
  * SUSE Linux Enterprise Server 12 SP4
  * SUSE Linux Enterprise Server 12 SP5
  * SUSE Linux Enterprise Server for SAP Applications 12 SP2
  * SUSE Linux Enterprise Server for SAP Applications 12 SP3
  * SUSE Linux Enterprise Server for SAP Applications 12 SP4
  * SUSE Linux Enterprise Server for SAP Applications 12 SP5

An update that has one fix can now be installed.

## Description:

This update for slurm_23_02 fixes the following issues:

  * Updated to version 23.02.5 with the following changes:

  * Bug Fixes:

    * Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the job's environment when `--ntasks-per-node` was requested. The method that is is being set, however, is different and should be more accurate in more situations.
    * Change pmi2 plugin to honor the `SrunPortRange` option. This matches the new behavior of the pmix plugin in 23.02.0. Note that neither of these plugins makes use of the `MpiParams=ports=` option, and previously were only limited by the systems ephemeral port range.
    * Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if a node features plugin is configured.
    * Fix and prevent reoccurring reservations from overlapping.
    * `job_container/tmpfs` \- Avoid attempts to share BasePath between nodes.
    * With `CR_Cpu_Memory`, fix node selection for jobs that request gres and `--mem-per-cpu`.
    * Fix a regression from 22.05.7 in which some jobs were allocated too few nodes, thus overcommitting cpus to some tasks.
    * Fix a job being stuck in the completing state if the job ends while the primary controller is down or unresponsive and the backup controller has not yet taken over.
    * Fix `slurmctld` segfault when a node registers with a configured `CpuSpecList` while `slurmctld` configuration has the node without `CpuSpecList`.
    * Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after not registering by `ResumeTimeout`.
    * `slurmstepd` \- Avoid cleanup of `config.json-less` containers spooldir getting skipped.
    * Fix scontrol segfault when 'completing' command requested repeatedly in interactive mode.
    * Properly handle a race condition between `bind()` and `listen()` calls in the network stack when running with SrunPortRange set.
    * Federation - Fix revoked jobs being returned regardless of the `-a`/`--all` option for privileged users.
    * Federation - Fix canceling pending federated jobs from non-origin clusters which could leave federated jobs orphaned from the origin cluster.
    * Fix sinfo segfault when printing multiple clusters with `--noheader` option.
    * Federation - fix clusters not syncing if clusters are added to a federation before they have registered with the dbd.
    * `node_features/helpers` \- Fix node selection for jobs requesting changeable. features with the `|` operator, which could prevent jobs from running on some valid nodes.
    * `node_features/helpers` \- Fix inconsistent handling of `&` and `|`, where an AND'd feature was sometimes AND'd to all sets of features instead of just the current set. E.g. `foo|bar&baz` was interpreted as `{foo,baz}` or `{bar,baz}` instead of how it is documented: `{foo} or {bar,baz}`.
    * Fix job accounting so that when a job is requeued its allocated node count is cleared. After the requeue, sacct will correctly show that the job has 0 `AllocNodes` while it is pending or if it is canceled before restarting.
    * `sacct` \- `AllocCPUS` now correctly shows 0 if a job has not yet received an allocation or if the job was canceled before getting one.
    * Fix intel OneAPI autodetect: detect the `/dev/dri/renderD[0-9]+` GPUs, and do not detect `/dev/dri/card[0-9]+`.
    * Fix node selection for jobs that request `--gpus` and a number of tasks fewer than GPUs, which resulted in incorrectly rejecting these jobs.
    * Remove `MYSQL_OPT_RECONNECT` completely.
    * Fix cloud nodes in `POWERING_UP` state disappearing (getting set to `FUTURE`) when an `scontrol reconfigure` happens.
    * `openapi/dbv0.0.39` \- Avoid assert / segfault on missing coordinators list.
    * `slurmrestd` \- Correct memory leak while parsing OpenAPI specification templates with server overrides.
    * Fix overwriting user node reason with system message.
    * Prevent deadlock when `rpc_queue` is enabled.
    * `slurmrestd` \- Correct OpenAPI specification generation bug where fields with overlapping parent paths would not get generated.
    * Fix memory leak as a result of a partition info query.
    * Fix memory leak as a result of a job info query.
    * For step allocations, fix `--gres=none` sometimes not ignoring gres from the job.
    * Fix `--exclusive` jobs incorrectly gang-scheduling where they shouldn't.
    * Fix allocations with `CR_SOCKET`, gres not assigned to a specific socket, and block core distribion potentially allocating more sockets than required.
    * Revert a change in 23.02.3 where Slurm would kill a script's process group as soon as the script ended instead of waiting as long as any process in that process group held the stdout/stderr file descriptors open. That change broke some scripts that relied on the previous behavior. Setting time limits for scripts (such as `PrologEpilogTimeout`) is strongly encouraged to avoid Slurm waiting indefinitely for scripts to finish.
    * Fix `slurmdbd -R` not returning an error under certain conditions.
    * `slurmdbd` \- Avoid potential NULL pointer dereference in the mysql plugin.
    * Fix regression in 23.02.3 which broken X11 forwarding for hosts when MUNGE sends a localhost address in the encode host field. This is caused when the node hostname is mapped to 127.0.0.1 (or similar) in `/etc/hosts`.
    * `openapi/[db]v0.0.39` \- fix memory leak on parsing error.
    * `data_parser/v0.0.39` \- fix updating qos for associations.
    * `openapi/dbv0.0.39` \- fix updating values for associations with null users.
    * Fix minor memory leak with `--tres-per-task` and licenses.
    * Fix cyclic socket cpu distribution for tasks in a step where `--cpus-per-task` < usable threads per core.
    * `slurmrestd` \- For `GET /slurm/v0.0.39/node[s]`, change format of node's energy field `current_watts` to a dictionary to account for unset value instead of dumping 4294967294.
    * `slurmrestd` \- For `GET /slurm/v0.0.39/qos`, change format of QOS's field "priority" to a dictionary to account for unset value instead of dumping 4294967294.
    * slurmrestd - For `GET /slurm/v0.0.39/job[s]`, the 'return code' code field in `v0.0.39_job_exit`_code will be set to -127 instead of being left unset where job does not have a relevant return code.
  * Other Changes:

    * Remove --uid / --gid options from salloc and srun commands. These options did not work correctly since the CVE-2022-29500 fix in combination with some changes made in 23.02.0.
    * Add the `JobId` to `debug()` messages indicating when `cpus_per_task/mem_per_cpu` or `pn_min_cpus` are being automatically adjusted.
    * Change the log message warning for rate limited users from verbose to info.
    * `slurmstepd` \- Cleanup per task generated environment for containers in spooldir.
    * Format batch, extern, interactive, and pending step ids into strings that are human readable.
    * `slurmrestd` \- Reduce memory usage when printing out job CPU frequency.
    * `data_parser/v0.0.39` \- Add `required/memory_per_cpu` and `required/memory_per_node` to `sacct --json` and `sacct --yaml` and `GET /slurmdb/v0.0.39/jobs` from slurmrestd.
    * `gpu/oneapi` \- Store cores correctly so CPU affinity is tracked.
    * Allow `slurmdbd -R` to work if the root assoc id is not 1.
    * Limit periodic node registrations to 50 instead of the full `TreeWidth`. Since unresolvable `cloud/dynamic` nodes must disable fanout by setting `TreeWidth` to a large number, this would cause all nodes to register at once.

## Patch Instructions:

To install this SUSE update use the SUSE recommended installation methods like
YaST online_update or "zypper patch".  
Alternatively you can run the command listed for your product:

  * HPC Module 12  
    zypper in -t patch SUSE-SLE-Module-HPC-12-2023-4336=1

## Package List:

  * HPC Module 12 (aarch64 x86_64)
    * slurm_23_02-node-23.02.5-3.10.6
    * libslurm39-23.02.5-3.10.6
    * slurm_23_02-lua-23.02.5-3.10.6
    * slurm_23_02-pam_slurm-23.02.5-3.10.6
    * perl-slurm_23_02-23.02.5-3.10.6
    * libslurm39-debuginfo-23.02.5-3.10.6
    * slurm_23_02-sql-23.02.5-3.10.6
    * libnss_slurm2_23_02-23.02.5-3.10.6
    * slurm_23_02-slurmdbd-debuginfo-23.02.5-3.10.6
    * slurm_23_02-sview-23.02.5-3.10.6
    * slurm_23_02-devel-23.02.5-3.10.6
    * slurm_23_02-cray-debuginfo-23.02.5-3.10.6
    * slurm_23_02-node-debuginfo-23.02.5-3.10.6
    * slurm_23_02-plugins-debuginfo-23.02.5-3.10.6
    * slurm_23_02-auth-none-debuginfo-23.02.5-3.10.6
    * slurm_23_02-23.02.5-3.10.6
    * slurm_23_02-munge-debuginfo-23.02.5-3.10.6
    * slurm_23_02-plugin-ext-sensors-rrd-debuginfo-23.02.5-3.10.6
    * slurm_23_02-slurmdbd-23.02.5-3.10.6
    * slurm_23_02-plugin-ext-sensors-rrd-23.02.5-3.10.6
    * slurm_23_02-cray-23.02.5-3.10.6
    * libpmi0_23_02-23.02.5-3.10.6
    * slurm_23_02-torque-23.02.5-3.10.6
    * libnss_slurm2_23_02-debuginfo-23.02.5-3.10.6
    * slurm_23_02-lua-debuginfo-23.02.5-3.10.6
    * slurm_23_02-sql-debuginfo-23.02.5-3.10.6
    * slurm_23_02-pam_slurm-debuginfo-23.02.5-3.10.6
    * libpmi0_23_02-debuginfo-23.02.5-3.10.6
    * slurm_23_02-auth-none-23.02.5-3.10.6
    * slurm_23_02-plugins-23.02.5-3.10.6
    * perl-slurm_23_02-debuginfo-23.02.5-3.10.6
    * slurm_23_02-debuginfo-23.02.5-3.10.6
    * slurm_23_02-debugsource-23.02.5-3.10.6
    * slurm_23_02-sview-debuginfo-23.02.5-3.10.6
    * slurm_23_02-torque-debuginfo-23.02.5-3.10.6
    * slurm_23_02-munge-23.02.5-3.10.6
  * HPC Module 12 (noarch)
    * slurm_23_02-webdoc-23.02.5-3.10.6
    * slurm_23_02-config-man-23.02.5-3.10.6
    * slurm_23_02-config-23.02.5-3.10.6
    * slurm_23_02-doc-23.02.5-3.10.6

## References:

  * https://bugzilla.suse.com/show_bug.cgi?id=1215437

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.suse.com/pipermail/sle-updates/attachments/20231102/5431ea5f/attachment.htm>