SUSE-RU-2023:4334-1: moderate: Recommended update for slurm_23_02
sle-updates at lists.suse.com
sle-updates at lists.suse.com
Thu Nov 2 08:30:09 UTC 2023
# Recommended update for slurm_23_02
Announcement ID: SUSE-RU-2023:4334-1
Rating: moderate
References:
* bsc#1215437
Affected Products:
* SUSE Linux Enterprise High Performance Computing 15 SP2
* SUSE Linux Enterprise High Performance Computing 15 SP2 LTSS 15-SP2
An update that has one fix can now be installed.
## Description:
This update for slurm_23_02 fixes the following issues:
* Updated to version 23.02.5 with the following changes:
* Bug Fixes:
* Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the job's environment when `--ntasks-per-node` was requested. The method that is is being set, however, is different and should be more accurate in more situations.
* Change pmi2 plugin to honor the `SrunPortRange` option. This matches the new behavior of the pmix plugin in 23.02.0. Note that neither of these plugins makes use of the `MpiParams=ports=` option, and previously were only limited by the systems ephemeral port range.
* Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if a node features plugin is configured.
* Fix and prevent reoccurring reservations from overlapping.
* `job_container/tmpfs` \- Avoid attempts to share BasePath between nodes.
* With `CR_Cpu_Memory`, fix node selection for jobs that request gres and `--mem-per-cpu`.
* Fix a regression from 22.05.7 in which some jobs were allocated too few nodes, thus overcommitting cpus to some tasks.
* Fix a job being stuck in the completing state if the job ends while the primary controller is down or unresponsive and the backup controller has not yet taken over.
* Fix `slurmctld` segfault when a node registers with a configured `CpuSpecList` while `slurmctld` configuration has the node without `CpuSpecList`.
* Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after not registering by `ResumeTimeout`.
* `slurmstepd` \- Avoid cleanup of `config.json-less` containers spooldir getting skipped.
* Fix scontrol segfault when 'completing' command requested repeatedly in interactive mode.
* Properly handle a race condition between `bind()` and `listen()` calls in the network stack when running with SrunPortRange set.
* Federation - Fix revoked jobs being returned regardless of the `-a`/`--all` option for privileged users.
* Federation - Fix canceling pending federated jobs from non-origin clusters which could leave federated jobs orphaned from the origin cluster.
* Fix sinfo segfault when printing multiple clusters with `--noheader` option.
* Federation - fix clusters not syncing if clusters are added to a federation before they have registered with the dbd.
* `node_features/helpers` \- Fix node selection for jobs requesting changeable. features with the `|` operator, which could prevent jobs from running on some valid nodes.
* `node_features/helpers` \- Fix inconsistent handling of `&` and `|`, where an AND'd feature was sometimes AND'd to all sets of features instead of just the current set. E.g. `foo|bar&baz` was interpreted as `{foo,baz}` or `{bar,baz}` instead of how it is documented: `{foo} or {bar,baz}`.
* Fix job accounting so that when a job is requeued its allocated node count is cleared. After the requeue, sacct will correctly show that the job has 0 `AllocNodes` while it is pending or if it is canceled before restarting.
* `sacct` \- `AllocCPUS` now correctly shows 0 if a job has not yet received an allocation or if the job was canceled before getting one.
* Fix intel OneAPI autodetect: detect the `/dev/dri/renderD[0-9]+` GPUs, and do not detect `/dev/dri/card[0-9]+`.
* Fix node selection for jobs that request `--gpus` and a number of tasks fewer than GPUs, which resulted in incorrectly rejecting these jobs.
* Remove `MYSQL_OPT_RECONNECT` completely.
* Fix cloud nodes in `POWERING_UP` state disappearing (getting set to `FUTURE`) when an `scontrol reconfigure` happens.
* `openapi/dbv0.0.39` \- Avoid assert / segfault on missing coordinators list.
* `slurmrestd` \- Correct memory leak while parsing OpenAPI specification templates with server overrides.
* Fix overwriting user node reason with system message.
* Prevent deadlock when `rpc_queue` is enabled.
* `slurmrestd` \- Correct OpenAPI specification generation bug where fields with overlapping parent paths would not get generated.
* Fix memory leak as a result of a partition info query.
* Fix memory leak as a result of a job info query.
* For step allocations, fix `--gres=none` sometimes not ignoring gres from the job.
* Fix `--exclusive` jobs incorrectly gang-scheduling where they shouldn't.
* Fix allocations with `CR_SOCKET`, gres not assigned to a specific socket, and block core distribion potentially allocating more sockets than required.
* Revert a change in 23.02.3 where Slurm would kill a script's process group as soon as the script ended instead of waiting as long as any process in that process group held the stdout/stderr file descriptors open. That change broke some scripts that relied on the previous behavior. Setting time limits for scripts (such as `PrologEpilogTimeout`) is strongly encouraged to avoid Slurm waiting indefinitely for scripts to finish.
* Fix `slurmdbd -R` not returning an error under certain conditions.
* `slurmdbd` \- Avoid potential NULL pointer dereference in the mysql plugin.
* Fix regression in 23.02.3 which broken X11 forwarding for hosts when MUNGE sends a localhost address in the encode host field. This is caused when the node hostname is mapped to 127.0.0.1 (or similar) in `/etc/hosts`.
* `openapi/[db]v0.0.39` \- fix memory leak on parsing error.
* `data_parser/v0.0.39` \- fix updating qos for associations.
* `openapi/dbv0.0.39` \- fix updating values for associations with null users.
* Fix minor memory leak with `--tres-per-task` and licenses.
* Fix cyclic socket cpu distribution for tasks in a step where `--cpus-per-task` < usable threads per core.
* `slurmrestd` \- For `GET /slurm/v0.0.39/node[s]`, change format of node's energy field `current_watts` to a dictionary to account for unset value instead of dumping 4294967294.
* `slurmrestd` \- For `GET /slurm/v0.0.39/qos`, change format of QOS's field "priority" to a dictionary to account for unset value instead of dumping 4294967294.
* slurmrestd - For `GET /slurm/v0.0.39/job[s]`, the 'return code' code field in `v0.0.39_job_exit`_code will be set to -127 instead of being left unset where job does not have a relevant return code.
* Other Changes:
* Remove --uid / --gid options from salloc and srun commands. These options did not work correctly since the CVE-2022-29500 fix in combination with some changes made in 23.02.0.
* Add the `JobId` to `debug()` messages indicating when `cpus_per_task/mem_per_cpu` or `pn_min_cpus` are being automatically adjusted.
* Change the log message warning for rate limited users from verbose to info.
* `slurmstepd` \- Cleanup per task generated environment for containers in spooldir.
* Format batch, extern, interactive, and pending step ids into strings that are human readable.
* `slurmrestd` \- Reduce memory usage when printing out job CPU frequency.
* `data_parser/v0.0.39` \- Add `required/memory_per_cpu` and `required/memory_per_node` to `sacct --json` and `sacct --yaml` and `GET /slurmdb/v0.0.39/jobs` from slurmrestd.
* `gpu/oneapi` \- Store cores correctly so CPU affinity is tracked.
* Allow `slurmdbd -R` to work if the root assoc id is not 1.
* Limit periodic node registrations to 50 instead of the full `TreeWidth`. Since unresolvable `cloud/dynamic` nodes must disable fanout by setting `TreeWidth` to a large number, this would cause all nodes to register at once.
## Patch Instructions:
To install this SUSE update use the SUSE recommended installation methods like
YaST online_update or "zypper patch".
Alternatively you can run the command listed for your product:
* SUSE Linux Enterprise High Performance Computing 15 SP2 LTSS 15-SP2
zypper in -t patch SUSE-SLE-Product-HPC-15-SP2-LTSS-2023-4334=1
## Package List:
* SUSE Linux Enterprise High Performance Computing 15 SP2 LTSS 15-SP2 (aarch64
x86_64)
* libnss_slurm2_23_02-23.02.5-150200.5.11.2
* libpmi0_23_02-debuginfo-23.02.5-150200.5.11.2
* slurm_23_02-slurmdbd-23.02.5-150200.5.11.2
* slurm_23_02-plugin-ext-sensors-rrd-debuginfo-23.02.5-150200.5.11.2
* slurm_23_02-slurmdbd-debuginfo-23.02.5-150200.5.11.2
* slurm_23_02-pam_slurm-23.02.5-150200.5.11.2
* slurm_23_02-sql-debuginfo-23.02.5-150200.5.11.2
* slurm_23_02-sview-23.02.5-150200.5.11.2
* slurm_23_02-munge-23.02.5-150200.5.11.2
* slurm_23_02-rest-23.02.5-150200.5.11.2
* slurm_23_02-lua-23.02.5-150200.5.11.2
* slurm_23_02-munge-debuginfo-23.02.5-150200.5.11.2
* slurm_23_02-debuginfo-23.02.5-150200.5.11.2
* slurm_23_02-plugins-debuginfo-23.02.5-150200.5.11.2
* slurm_23_02-23.02.5-150200.5.11.2
* slurm_23_02-lua-debuginfo-23.02.5-150200.5.11.2
* slurm_23_02-torque-23.02.5-150200.5.11.2
* slurm_23_02-cray-debuginfo-23.02.5-150200.5.11.2
* slurm_23_02-node-debuginfo-23.02.5-150200.5.11.2
* slurm_23_02-sview-debuginfo-23.02.5-150200.5.11.2
* slurm_23_02-cray-23.02.5-150200.5.11.2
* libnss_slurm2_23_02-debuginfo-23.02.5-150200.5.11.2
* libslurm39-23.02.5-150200.5.11.2
* slurm_23_02-sql-23.02.5-150200.5.11.2
* perl-slurm_23_02-23.02.5-150200.5.11.2
* slurm_23_02-plugins-23.02.5-150200.5.11.2
* slurm_23_02-auth-none-debuginfo-23.02.5-150200.5.11.2
* libslurm39-debuginfo-23.02.5-150200.5.11.2
* slurm_23_02-pam_slurm-debuginfo-23.02.5-150200.5.11.2
* perl-slurm_23_02-debuginfo-23.02.5-150200.5.11.2
* libpmi0_23_02-23.02.5-150200.5.11.2
* slurm_23_02-torque-debuginfo-23.02.5-150200.5.11.2
* slurm_23_02-devel-23.02.5-150200.5.11.2
* slurm_23_02-node-23.02.5-150200.5.11.2
* slurm_23_02-auth-none-23.02.5-150200.5.11.2
* slurm_23_02-plugin-ext-sensors-rrd-23.02.5-150200.5.11.2
* slurm_23_02-rest-debuginfo-23.02.5-150200.5.11.2
* slurm_23_02-debugsource-23.02.5-150200.5.11.2
* SUSE Linux Enterprise High Performance Computing 15 SP2 LTSS 15-SP2 (noarch)
* slurm_23_02-doc-23.02.5-150200.5.11.2
* slurm_23_02-webdoc-23.02.5-150200.5.11.2
* slurm_23_02-config-man-23.02.5-150200.5.11.2
* slurm_23_02-config-23.02.5-150200.5.11.2
## References:
* https://bugzilla.suse.com/show_bug.cgi?id=1215437
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.suse.com/pipermail/sle-updates/attachments/20231102/1412d603/attachment.htm>
More information about the sle-updates
mailing list