<div class="container">
<h1>Recommended update for slurm_23_02</h1>
<table class="table table-striped table-bordered">
<tbody>
<tr>
<th>Announcement ID:</th>
<td>SUSE-RU-2023:4336-1</td>
</tr>
<tr>
<th>Rating:</th>
<td>moderate</td>
</tr>
<tr>
<th>References:</th>
<td>
<ul>
<li style="display: inline;">
<a href="https://bugzilla.suse.com/show_bug.cgi?id=1215437">bsc#1215437</a>
</li>
</ul>
</td>
</tr>
<tr>
<th>Affected Products:</th>
<td>
<ul class="list-group">
<li class="list-group-item">HPC Module 12</li>
<li class="list-group-item">SUSE Linux Enterprise High Performance Computing 12 SP2</li>
<li class="list-group-item">SUSE Linux Enterprise High Performance Computing 12 SP3</li>
<li class="list-group-item">SUSE Linux Enterprise High Performance Computing 12 SP4</li>
<li class="list-group-item">SUSE Linux Enterprise High Performance Computing 12 SP5</li>
<li class="list-group-item">SUSE Linux Enterprise Server 12 SP2</li>
<li class="list-group-item">SUSE Linux Enterprise Server 12 SP3</li>
<li class="list-group-item">SUSE Linux Enterprise Server 12 SP4</li>
<li class="list-group-item">SUSE Linux Enterprise Server 12 SP5</li>
<li class="list-group-item">SUSE Linux Enterprise Server for SAP Applications 12 SP2</li>
<li class="list-group-item">SUSE Linux Enterprise Server for SAP Applications 12 SP3</li>
<li class="list-group-item">SUSE Linux Enterprise Server for SAP Applications 12 SP4</li>
<li class="list-group-item">SUSE Linux Enterprise Server for SAP Applications 12 SP5</li>
</ul>
</td>
</tr>
</tbody>
</table>
<p>An update that has one fix can now be installed.</p>
<h2>Description:</h2>
<p>This update for slurm_23_02 fixes the following issues:</p>
<ul>
<li>
<p>Updated to version 23.02.5 with the following changes:</p>
</li>
<li>
<p>Bug Fixes:</p>
<ul>
<li>Revert a change in 23.02 where <code>SLURM_NTASKS</code> was no longer set in the
job's environment when <code>--ntasks-per-node</code> was requested.
The method that is is being set, however, is different and should be more
accurate in more situations.</li>
<li>Change pmi2 plugin to honor the <code>SrunPortRange</code> option. This matches the
new behavior of the pmix plugin in 23.02.0. Note that neither of these
plugins makes use of the <code>MpiParams=ports=</code> option, and previously
were only limited by the systems ephemeral port range.</li>
<li>Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
a node features plugin is configured.</li>
<li>Fix and prevent reoccurring reservations from overlapping.</li>
<li><code>job_container/tmpfs</code> - Avoid attempts to share BasePath between nodes.</li>
<li>With <code>CR_Cpu_Memory</code>, fix node selection for jobs that request gres and
<code>--mem-per-cpu</code>.</li>
<li>Fix a regression from 22.05.7 in which some jobs were allocated too few
nodes, thus overcommitting cpus to some tasks.</li>
<li>Fix a job being stuck in the completing state if the job ends while the
primary controller is down or unresponsive and the backup controller has
not yet taken over.</li>
<li>Fix <code>slurmctld</code> segfault when a node registers with a configured
<code>CpuSpecList</code> while <code>slurmctld</code> configuration has the node without
<code>CpuSpecList</code>.</li>
<li>Fix cloud nodes getting stuck in <code>POWERED_DOWN+NO_RESPOND</code> state after
not registering by <code>ResumeTimeout</code>.</li>
<li><code>slurmstepd</code> - Avoid cleanup of <code>config.json-less</code> containers spooldir
getting skipped.</li>
<li>Fix scontrol segfault when 'completing' command requested repeatedly in
interactive mode.</li>
<li>Properly handle a race condition between <code>bind()</code> and <code>listen()</code> calls
in the network stack when running with SrunPortRange set.</li>
<li>Federation - Fix revoked jobs being returned regardless of the
<code>-a</code>/<code>--all</code> option for privileged users.</li>
<li>Federation - Fix canceling pending federated jobs from non-origin
clusters which could leave federated jobs orphaned from the origin
cluster.</li>
<li>Fix sinfo segfault when printing multiple clusters with <code>--noheader</code>
option.</li>
<li>Federation - fix clusters not syncing if clusters are added to a
federation before they have registered with the dbd.</li>
<li><code>node_features/helpers</code> - Fix node selection for jobs requesting
changeable.
features with the <code>|</code> operator, which could prevent jobs from
running on some valid nodes.</li>
<li><code>node_features/helpers</code> - Fix inconsistent handling of <code>&</code> and <code>|</code>,
where an AND'd feature was sometimes AND'd to all sets of features
instead of just the current set. E.g. <code>foo|bar&baz</code> was interpreted
as <code>{foo,baz}</code> or <code>{bar,baz}</code> instead of how it is documented:
<code>{foo} or {bar,baz}</code>.</li>
<li>Fix job accounting so that when a job is requeued its allocated node
count is cleared. After the requeue, sacct will correctly show that
the job has 0 <code>AllocNodes</code> while it is pending or if it is canceled
before restarting.</li>
<li><code>sacct</code> - <code>AllocCPUS</code> now correctly shows 0 if a job has not yet
received an allocation or if the job was canceled before getting one.</li>
<li>Fix intel OneAPI autodetect: detect the <code>/dev/dri/renderD[0-9]+</code> GPUs,
and do not detect <code>/dev/dri/card[0-9]+</code>.</li>
<li>Fix node selection for jobs that request <code>--gpus</code> and a number of
tasks fewer than GPUs, which resulted in incorrectly rejecting these
jobs.</li>
<li>Remove <code>MYSQL_OPT_RECONNECT</code> completely.</li>
<li>Fix cloud nodes in <code>POWERING_UP</code> state disappearing (getting set
to <code>FUTURE</code>)
when an <code>scontrol reconfigure</code> happens.</li>
<li><code>openapi/dbv0.0.39</code> - Avoid assert / segfault on missing coordinators
list.</li>
<li><code>slurmrestd</code> - Correct memory leak while parsing OpenAPI specification
templates with server overrides.</li>
<li>Fix overwriting user node reason with system message.</li>
<li>Prevent deadlock when <code>rpc_queue</code> is enabled.</li>
<li><code>slurmrestd</code> - Correct OpenAPI specification generation bug where
fields with overlapping parent paths would not get generated.</li>
<li>Fix memory leak as a result of a partition info query.</li>
<li>Fix memory leak as a result of a job info query.</li>
<li>For step allocations, fix <code>--gres=none</code> sometimes not ignoring gres
from the job.</li>
<li>Fix <code>--exclusive</code> jobs incorrectly gang-scheduling where they shouldn't.</li>
<li>Fix allocations with <code>CR_SOCKET</code>, gres not assigned to a specific
socket, and block core distribion potentially allocating more sockets
than required.</li>
<li>Revert a change in 23.02.3 where Slurm would kill a script's process
group as soon as the script ended instead of waiting as long as any
process in that process group held the stdout/stderr file descriptors
open. That change broke some scripts that relied on the previous
behavior. Setting time limits for scripts (such as
<code>PrologEpilogTimeout</code>) is strongly encouraged to avoid Slurm waiting
indefinitely for scripts to finish.</li>
<li>Fix <code>slurmdbd -R</code> not returning an error under certain conditions.</li>
<li><code>slurmdbd</code> - Avoid potential NULL pointer dereference in the mysql
plugin.</li>
<li>Fix regression in 23.02.3 which broken X11 forwarding for hosts when
MUNGE sends a localhost address in the encode host field. This is caused
when the node hostname is mapped to 127.0.0.1 (or similar) in
<code>/etc/hosts</code>.</li>
<li><code>openapi/[db]v0.0.39</code> - fix memory leak on parsing error.</li>
<li><code>data_parser/v0.0.39</code> - fix updating qos for associations.</li>
<li><code>openapi/dbv0.0.39</code> - fix updating values for associations with null
users.</li>
<li>Fix minor memory leak with <code>--tres-per-task</code> and licenses.</li>
<li>Fix cyclic socket cpu distribution for tasks in a step where
<code>--cpus-per-task</code> < usable threads per core.</li>
<li><code>slurmrestd</code> - For <code>GET /slurm/v0.0.39/node[s]</code>, change format of
node's energy field <code>current_watts</code> to a dictionary to account for
unset value instead of dumping 4294967294.</li>
<li><code>slurmrestd</code> - For <code>GET /slurm/v0.0.39/qos</code>, change format of QOS's
field "priority" to a dictionary to account for unset value instead of
dumping 4294967294.</li>
<li>slurmrestd - For <code>GET /slurm/v0.0.39/job[s]</code>, the 'return code'
code field in <code>v0.0.39_job_exit</code>_code will be set to -127 instead of
being left unset where job does not have a relevant return code.</li>
</ul>
</li>
<li>
<p>Other Changes:</p>
<ul>
<li>Remove --uid / --gid options from salloc and srun commands. These options
did not work correctly since the CVE-2022-29500 fix in combination with
some changes made in 23.02.0.</li>
<li>Add the <code>JobId</code> to <code>debug()</code> messages indicating when
<code>cpus_per_task/mem_per_cpu</code> or <code>pn_min_cpus</code> are being automatically
adjusted.</li>
<li>Change the log message warning for rate limited users from verbose to
info.</li>
<li><code>slurmstepd</code> - Cleanup per task generated environment for containers in
spooldir.</li>
<li>Format batch, extern, interactive, and pending step ids into strings that
are human readable.</li>
<li><code>slurmrestd</code> - Reduce memory usage when printing out job CPU frequency.</li>
<li><code>data_parser/v0.0.39</code> - Add <code>required/memory_per_cpu</code> and
<code>required/memory_per_node</code> to <code>sacct --json</code> and <code>sacct --yaml</code> and
<code>GET /slurmdb/v0.0.39/jobs</code> from slurmrestd.</li>
<li><code>gpu/oneapi</code> - Store cores correctly so CPU affinity is tracked.</li>
<li>Allow <code>slurmdbd -R</code> to work if the root assoc id is not 1.</li>
<li>Limit periodic node registrations to 50 instead of the full <code>TreeWidth</code>.
Since unresolvable <code>cloud/dynamic</code> nodes must disable fanout by setting
<code>TreeWidth</code> to a large number, this would cause all nodes to register at
once.</li>
</ul>
</li>
</ul>
<h2>Patch Instructions:</h2>
<p>
To install this SUSE update use the SUSE recommended
installation methods like YaST online_update or "zypper patch".<br/>
Alternatively you can run the command listed for your product:
</p>
<ul class="list-group">
<li class="list-group-item">
HPC Module 12
<br/>
<code>zypper in -t patch SUSE-SLE-Module-HPC-12-2023-4336=1</code>
</li>
</ul>
<h2>Package List:</h2>
<ul>
<li>
HPC Module 12 (aarch64 x86_64)
<ul>
<li>slurm_23_02-node-23.02.5-3.10.6</li>
<li>libslurm39-23.02.5-3.10.6</li>
<li>slurm_23_02-lua-23.02.5-3.10.6</li>
<li>slurm_23_02-pam_slurm-23.02.5-3.10.6</li>
<li>perl-slurm_23_02-23.02.5-3.10.6</li>
<li>libslurm39-debuginfo-23.02.5-3.10.6</li>
<li>slurm_23_02-sql-23.02.5-3.10.6</li>
<li>libnss_slurm2_23_02-23.02.5-3.10.6</li>
<li>slurm_23_02-slurmdbd-debuginfo-23.02.5-3.10.6</li>
<li>slurm_23_02-sview-23.02.5-3.10.6</li>
<li>slurm_23_02-devel-23.02.5-3.10.6</li>
<li>slurm_23_02-cray-debuginfo-23.02.5-3.10.6</li>
<li>slurm_23_02-node-debuginfo-23.02.5-3.10.6</li>
<li>slurm_23_02-plugins-debuginfo-23.02.5-3.10.6</li>
<li>slurm_23_02-auth-none-debuginfo-23.02.5-3.10.6</li>
<li>slurm_23_02-23.02.5-3.10.6</li>
<li>slurm_23_02-munge-debuginfo-23.02.5-3.10.6</li>
<li>slurm_23_02-plugin-ext-sensors-rrd-debuginfo-23.02.5-3.10.6</li>
<li>slurm_23_02-slurmdbd-23.02.5-3.10.6</li>
<li>slurm_23_02-plugin-ext-sensors-rrd-23.02.5-3.10.6</li>
<li>slurm_23_02-cray-23.02.5-3.10.6</li>
<li>libpmi0_23_02-23.02.5-3.10.6</li>
<li>slurm_23_02-torque-23.02.5-3.10.6</li>
<li>libnss_slurm2_23_02-debuginfo-23.02.5-3.10.6</li>
<li>slurm_23_02-lua-debuginfo-23.02.5-3.10.6</li>
<li>slurm_23_02-sql-debuginfo-23.02.5-3.10.6</li>
<li>slurm_23_02-pam_slurm-debuginfo-23.02.5-3.10.6</li>
<li>libpmi0_23_02-debuginfo-23.02.5-3.10.6</li>
<li>slurm_23_02-auth-none-23.02.5-3.10.6</li>
<li>slurm_23_02-plugins-23.02.5-3.10.6</li>
<li>perl-slurm_23_02-debuginfo-23.02.5-3.10.6</li>
<li>slurm_23_02-debuginfo-23.02.5-3.10.6</li>
<li>slurm_23_02-debugsource-23.02.5-3.10.6</li>
<li>slurm_23_02-sview-debuginfo-23.02.5-3.10.6</li>
<li>slurm_23_02-torque-debuginfo-23.02.5-3.10.6</li>
<li>slurm_23_02-munge-23.02.5-3.10.6</li>
</ul>
</li>
<li>
HPC Module 12 (noarch)
<ul>
<li>slurm_23_02-webdoc-23.02.5-3.10.6</li>
<li>slurm_23_02-config-man-23.02.5-3.10.6</li>
<li>slurm_23_02-config-23.02.5-3.10.6</li>
<li>slurm_23_02-doc-23.02.5-3.10.6</li>
</ul>
</li>
</ul>
<h2>References:</h2>
<ul>
<li>
<a href="https://bugzilla.suse.com/show_bug.cgi?id=1215437">https://bugzilla.suse.com/show_bug.cgi?id=1215437</a>
</li>
</ul>
</div>