SUSE-SU-2025:01757-1: important: Security update for slurm_24_11

Thu May 29 16:30:16 UTC 2025

# Security update for slurm_24_11

Announcement ID: SUSE-SU-2025:01757-1  
Release Date: 2025-05-29T14:47:58Z  
Rating: important  
References:

  * bsc#1243666

Cross-References:

  * CVE-2025-43904

CVSS scores:

  * CVE-2025-43904 ( SUSE ):  8.5
    CVSS:4.0/AV:L/AC:L/AT:N/PR:L/UI:N/VC:H/VI:H/VA:H/SC:N/SI:N/SA:N
  * CVE-2025-43904 ( SUSE ):  7.8 CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H

Affected Products:

  * HPC Module 12
  * SUSE Linux Enterprise High Performance Computing 12 SP2
  * SUSE Linux Enterprise High Performance Computing 12 SP3
  * SUSE Linux Enterprise High Performance Computing 12 SP4
  * SUSE Linux Enterprise High Performance Computing 12 SP5
  * SUSE Linux Enterprise Server 12 SP2
  * SUSE Linux Enterprise Server 12 SP3
  * SUSE Linux Enterprise Server 12 SP4
  * SUSE Linux Enterprise Server 12 SP5
  * SUSE Linux Enterprise Server for SAP Applications 12 SP2
  * SUSE Linux Enterprise Server for SAP Applications 12 SP3
  * SUSE Linux Enterprise Server for SAP Applications 12 SP4
  * SUSE Linux Enterprise Server for SAP Applications 12 SP5

An update that solves one vulnerability can now be installed.

## Description:

This update for slurm_24_11 fixes the following issues:

Update to version 24.11.5.

Security issues fixed:

  * CVE-2025-43904: an issue with permission handling for Coordinators within
    the accounting system allowed Coordinators to promote a user to
    Administrator (bsc#1243666).

Other changes and issues fixed:

  * Changes from version 24.11.5

  * Return error to `scontrol` reboot on bad nodelists.

  * `slurmrestd` \- Report an error when QOS resolution fails for v0.0.40
    endpoints.
  * `slurmrestd` \- Report an error when QOS resolution fails for v0.0.41
    endpoints.
  * `slurmrestd` \- Report an error when QOS resolution fails for v0.0.42
    endpoints.
  * `data_parser/v0.0.42` \- Added `+inline_enums` flag which modifies the
    output when generating OpenAPI specification. It causes enum arrays to not
    be defined in their own schema with references (`$ref`) to them. Instead
    they will be dumped inline.
  * Fix binding error with `tres-bind map/mask` on partial node allocations.
  * Fix `stepmgr` enabled steps being able to request features.
  * Reject step creation if requested feature is not available in job.
  * `slurmd` \- Restrict listening for new incoming RPC requests further into
    startup.
  * `slurmd` \- Avoid `auth/slurm` related hangs of CLI commands during startup
    and shutdown.
  * `slurmctld` \- Restrict processing new incoming RPC requests further into
    startup. Stop processing requests sooner during shutdown.
  * `slurmcltd` \- Avoid auth/slurm related hangs of CLI commands during startup
    and shutdown.
  * `slurmctld` \- Avoid race condition during shutdown or ereconfigure that
    could result in a crash due delayed processing of a connection while plugins
    are unloaded.
  * Fix small memleak when getting the job list from the database.
  * Fix incorrect printing of `%` escape characters when printing stdio fields
    for jobs.
  * Fix padding parsing when printing stdio fields for jobs.
  * Fix printing `%A` array job id when expanding patterns.
  * Fix reservations causing jobs to be held for `Bad Constraints`.
  * `switch/hpe_slingshot` \- Prevent potential segfault on failed curl request
    to the fabric manager.
  * Fix printing incorrect array job id when expanding stdio file names. The
    `%A` will now be substituted by the correct value.
  * Fix printing incorrect array job id when expanding stdio file names. The
    `%A` will now be substituted by the correct value.
  * `switch/hpe_slingshot` \- Fix VNI range not updating on slurmctld restart or
    reconfigre.
  * Fix steps not being created when using certain combinations of `-c` and `-n`
    inferior to the jobs requested resources, when using stepmgr and nodes are
    configured with `CPUs == Sockets*CoresPerSocket`.
  * Permit configuring the number of retry attempts to destroy CXI service via
    the new destroy_retries `SwitchParameter`.
  * Do not reset `memory.high` and `memory.swap.max` in slurmd startup or
    reconfigure as we are never really touching this in `slurmd`.
  * Fix reconfigure failure of slurmd when it has been started manually and the
    `CoreSpecLimits` have been removed from `slurm.conf`.
  * Set or reset CoreSpec limits when slurmd is reconfigured and it was started
    with systemd.
  * `switch/hpe-slingshot` \- Make sure the slurmctld can free step VNIs after
    the controller restarts or reconfigures while the job is running.
  * Fix backup `slurmctld` failure on 2nd takeover.

  * Changes from version 24.11.4

  * `slurmctld`,`slurmrestd` \- Avoid possible race condition that could have
    caused process to crash when listener socket was closed while accepting a
    new connection.

  * `slurmrestd` \- Avoid race condition that could have resulted in address
    logged for a UNIX socket to be incorrect.
  * `slurmrestd` \- Fix parameters in OpenAPI specification for the following
    endpoints to have `job_id` field: `GET /slurm/v0.0.40/jobs/state/ GET
    /slurm/v0.0.41/jobs/state/ GET /slurm/v0.0.42/jobs/state/ GET
    /slurm/v0.0.43/jobs/state/`
  * `slurmd` \- Fix tracking of thread counts that could cause incoming
    connections to be ignored after burst of simultaneous incoming connections
    that trigger delayed response logic.
  * Avoid unnecessary `SRUN_TIMEOUT` forwarding to `stepmgr`.
  * Fix jobs being scheduled on higher weighted powered down nodes.
  * Fix how backfill scheduler filters nodes from the available nodes based on
    exclusive user and `mcs_label` requirements.
  * `acct_gather_energy/{gpu,ipmi}` \- Fix potential energy consumption
    adjustment calculation underflow.
  * `acct_gather_energy/ipmi` \- Fix regression introduced in 24.05.5 (which
    introduced the new way of preserving energy measurements through slurmd
    restarts) when `EnergyIPMICalcAdjustment=yes`.
  * Prevent `slurmctld` deadlock in the assoc mgr.
  * Fix memory leak when `RestrictedCoresPerGPU` is enabled.
  * Fix preemptor jobs not entering execution due to wrong calculation of
    accounting policy limits.
  * Fix certain job requests that were incorrectly denied with node
    configuration unavailable error.
  * `slurmd` \- Avoid crash due when slurmd has a communications failure with
    `slurmstepd`.
  * Fix memory leak when parsing yaml input.
  * Prevent `slurmctld` from showing error message about `PreemptMode=GANG`
    being a cluster-wide option for `scontrol update part` calls that don't
    attempt to modify partition PreemptMode.
  * Fix setting `GANG` preemption on partition when updating `PreemptMode` with
    `scontrol`.
  * Fix `CoreSpec` and `MemSpec` limits not being removed from previously
    configured slurmd.
  * Avoid race condition that could lead to a deadlock when `slurmd`,
    `slurmstepd`, `slurmctld`, `slurmrestd` or `sackd` have a fatal event.
  * Fix jobs using `--ntasks-per-node` and `--mem` keep pending forever when the
    requested mem divided by the number of CPUs will surpass the configured
    `MaxMemPerCPU`.
  * `slurmd` \- Fix address logged upon new incoming RPC connection from
    `INVALID` to IP address.
  * Fix memory leak when retrieving reservations. This affects `scontrol`,
    `sinfo`, `sview`, and the following `slurmrestd` endpoints: `GET
    /slurm/{any_data_parser}/reservation/{reservation_name}` `GET
    /slurm/{any_data_parser}/reservations`
  * Log warning instead of `debuflags=conmgr` gated log when deferring new
    incoming connections when number of active connections exceed
    `conmgr_max_connections`.
  * Avoid race condition that could result in worker thread pool not activating
    all threads at once after a reconfigure resulting in lower utilization of
    available CPU threads until enough internal activity wakes up all threads in
    the worker pool.
  * Avoid theoretical race condition that could result in new incoming RPC
    socket connections being ignored after reconfigure.
  * slurmd - Avoid race condition that could result in a state where new
    incoming RPC connections will always be ignored.
  * Add ReconfigFlags=KeepNodeStateFuture to restore saved `FUTURE` node state
    on restart and reconfig instead of reverting to `FUTURE` state. This will be
    made the default in 25.05.
  * Fix case where hetjob submit would cause `slurmctld` to crash.
  * Fix jobs using `--cpus-per-gpu` and `--mem` keep pending forever when the
    requested mem divided by the number of CPUs will surpass the configured
    `MaxMemPerCPU`.
  * Enforce that jobs using `--mem` and several `--*-per-*` options do not
    violate the `MaxMemPerCPU` in place.
  * `slurmctld` \- Fix use-cases of jobs incorrectly pending held when
    `--prefer` features are not initially satisfied.
  * `slurmctld` \- Fix jobs incorrectly held when `--prefer` not satisfied in
    some use-cases.
  * Ensure `RestrictedCoresPerGPU` and `CoreSpecCount` don't overlap.

  * Changes from version 24.11.3

  * Fix database cluster ID generation not being random.

  * Fix a regression in which `slurmd -G` gave no output.
  * Fix a long-standing crash in `slurmctld` after updating a reservation with
    an empty nodelist. The crash could occur after restarting slurmctld, or if
    downing/draining a node in the reservation with the `REPLACE` or
    `REPLACE_DOWN` flag.
  * Avoid changing process name to "`watch`" from original daemon name. This
    could potentially breaking some monitoring scripts.
  * Avoid `slurmctld` being killed by `SIGALRM` due to race condition at
    startup.
  * Fix race condition in slurmrestd that resulted in "`Requested data_parser
    plugin does not support OpenAPI plugin`" error being returned for valid
    endpoints.
  * Fix race between `task/cgroup` CPUset and `jobacctgather/cgroup`. The first
    was removing the pid from `task_X` cgroup directory causing memory limits to
    not being applied.
  * If multiple partitions are requested, set the `SLURM_JOB_PARTITION` output
    environment variable to the partition in which the job is running for
    `salloc` and `srun` in order to match the documentation and the behavior of
    `sbatch`.
  * `srun` \- Fixed wrongly constructed `SLURM_CPU_BIND` env variable that could
    get propagated to downward srun calls in certain mpi environments, causing
    launch failures.
  * Don't print misleading errors for stepmgr enabled steps.
  * `slurmrestd` \- Avoid connection to slurmdbd for the following endpoints:
    `GET /slurm/v0.0.41/jobs GET /slurm/v0.0.41/job/{job_id}`
  * `slurmrestd` \- Avoid connection to slurmdbd for the following endpoints:
    `GET /slurm/v0.0.40/jobs GET /slurm/v0.0.40/job/{job_id}`
  * `slurmrestd` \- Fix possible memory leak when parsing arrays with
    `data_parser/v0.0.40`.
  * `slurmrestd` \- Fix possible memory leak when parsing arrays with
    `data_parser/v0.0.41`.
  * `slurmrestd` \- Fix possible memory leak when parsing arrays with
    `data_parser/v0.0.42`.

  * Changes from version 24.11.2

  * Fix segfault when submitting `--test-only` jobs that can preempt.

  * Fix regression introduced in 23.11 that prevented the following flags from
    being added to a reservation on an update: `DAILY`, `HOURLY`, `WEEKLY`,
    `WEEKDAY`, and `WEEKEND`.
  * Fix crash and issues evaluating job's suitability for running in nodes with
    already suspended job(s) there.
  * `slurmctld` will ensure that healthy nodes are not reported as
    `UnavailableNodes` in job reason codes.
  * Fix handling of jobs submitted to a current reservation with flags
    `OVERLAP,FLEX` or `OVERLAP,ANY_NODES` when it overlaps nodes with a future
    maintenance reservation. When a job submission had a time limit that
    overlapped with the future maintenance reservation, it was rejected. Now the
    job is accepted but stays pending with the reason "`ReqNodeNotAvail,
    Reserved for maintenance`".
  * `pam_slurm_adopt` \- avoid errors when explicitly setting some arguments to
    the default value.
  * Fix QOS preemption with `PreemptMode=SUSPEND`.
  * `slurmdbd` \- When changing a user's name update lineage at the same time.
  * Fix regression in 24.11 in which `burst_buffer.lua` does not inherit the
    `SLURM_CONF` environment variable from `slurmctld` and fails to run if
    slurm.conf is in a non-standard location.
  * Fix memory leak in slurmctld if `select/linear` and the
    `PreemptParameters=reclaim_licenses` options are both set in `slurm.conf`.
    Regression in 24.11.1.
  * Fix running jobs, that requested multiple partitions, from potentially being
    set to the wrong partition on restart.
  * `switch/hpe_slingshot` \- Fix compatibility with newer cxi drivers,
    specifically when specifying `disable_rdzv_get`.
  * Add `ABORT_ON_FATAL` environment variable to capture a backtrace from any
    `fatal()` message.
  * Fix printing invalid address in rate limiting log statement.
  * `sched/backfill` \- Fix node state `PLANNED` not being cleared from fully
    allocated nodes during a backfill cycle.
  * `select/cons_tres` \- Fix future planning of jobs with `bf_licenses`.
  * Prevent redundant "`on_data returned rc: Rate limit exceeded, please retry
    momentarily`" error message from being printed in slurmctld logs.
  * Fix loading non-default QOS on pending jobs from pre-24.11 state.
  * Fix pending jobs displaying `QOS=(null)` when not explicitly requesting a
    QOS.
  * Fix segfault issue from job record with no `job_resrcs`.
  * Fix failing `sacctmgr delete/modify/show` account operations with `where`
    clauses.
  * Fix regression in 24.11 in which Slurm daemons started catching several
    `SIGTSTP`, `SIGTTIN` and `SIGUSR1` signals and ignored them, while before
    they were not ignoring them. This also caused slurmctld to not being able to
    shutdown after a `SIGTSTP` because slurmscriptd caught the signal and
    stopped while slurmctld ignored it. Unify and fix these situations and get
    back to the previous behavior for these signals.
  * Document that `SIGQUIT` is no longer ignored by `slurmctld`, `slurmdbd`, and
    slurmd in 24.11. As of 24.11.0rc1, `SIGQUIT` is identical to `SIGINT` and
    `SIGTERM` for these daemons, but this change was not documented.
  * Fix not considering nodes marked for reboot without ASAP in the scheduler.
  * Remove the `boot^` state on unexpected node reboot after return to service.
  * Do not allow new jobs to start on a node which is being rebooted with the
    flag `nextstate=resume`.
  * Prevent lower priority job running after cancelling an ASAP reboot.
  * Fix srun jobs starting on `nextstate=resume` rebooting nodes.

## Patch Instructions:

To install this SUSE update use the SUSE recommended installation methods like
YaST online_update or "zypper patch".  
Alternatively you can run the command listed for your product:

  * HPC Module 12  
    zypper in -t patch SUSE-SLE-Module-HPC-12-2025-1757=1

## Package List:

  * HPC Module 12 (aarch64 x86_64)
    * slurm_24_11-24.11.5-3.8.1
    * slurm_24_11-torque-debuginfo-24.11.5-3.8.1
    * slurm_24_11-munge-debuginfo-24.11.5-3.8.1
    * slurm_24_11-node-24.11.5-3.8.1
    * slurm_24_11-auth-none-debuginfo-24.11.5-3.8.1
    * slurm_24_11-node-debuginfo-24.11.5-3.8.1
    * slurm_24_11-pam_slurm-24.11.5-3.8.1
    * libnss_slurm2_24_11-debuginfo-24.11.5-3.8.1
    * slurm_24_11-sview-24.11.5-3.8.1
    * slurm_24_11-lua-debuginfo-24.11.5-3.8.1
    * libnss_slurm2_24_11-24.11.5-3.8.1
    * slurm_24_11-devel-24.11.5-3.8.1
    * slurm_24_11-slurmdbd-debuginfo-24.11.5-3.8.1
    * slurm_24_11-torque-24.11.5-3.8.1
    * slurm_24_11-munge-24.11.5-3.8.1
    * slurm_24_11-sview-debuginfo-24.11.5-3.8.1
    * slurm_24_11-plugins-debuginfo-24.11.5-3.8.1
    * slurm_24_11-sql-24.11.5-3.8.1
    * libpmi0_24_11-24.11.5-3.8.1
    * slurm_24_11-slurmdbd-24.11.5-3.8.1
    * perl-slurm_24_11-24.11.5-3.8.1
    * slurm_24_11-debuginfo-24.11.5-3.8.1
    * libpmi0_24_11-debuginfo-24.11.5-3.8.1
    * slurm_24_11-auth-none-24.11.5-3.8.1
    * slurm_24_11-cray-24.11.5-3.8.1
    * libslurm42-debuginfo-24.11.5-3.8.1
    * slurm_24_11-plugins-24.11.5-3.8.1
    * slurm_24_11-sql-debuginfo-24.11.5-3.8.1
    * slurm_24_11-lua-24.11.5-3.8.1
    * slurm_24_11-pam_slurm-debuginfo-24.11.5-3.8.1
    * perl-slurm_24_11-debuginfo-24.11.5-3.8.1
    * libslurm42-24.11.5-3.8.1
  * HPC Module 12 (noarch)
    * slurm_24_11-doc-24.11.5-3.8.1
    * slurm_24_11-webdoc-24.11.5-3.8.1
    * slurm_24_11-config-man-24.11.5-3.8.1
    * slurm_24_11-config-24.11.5-3.8.1

## References:

  * https://www.suse.com/security/cve/CVE-2025-43904.html
  * https://bugzilla.suse.com/show_bug.cgi?id=1243666

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.suse.com/pipermail/sle-security-updates/attachments/20250529/6c9d9b08/attachment.htm>