With the 21.08.8 release Tim wrote: Once all daemons have been upgraded sites are encouraged to add "block_null_hash" to CommunicationParameters. That new option provides additional protection against a potential exploit. However, the slurm.conf man-page contains no block_null_hash value anywhere! I don't dare to add undocumented CommunicationParameters in slurm.conf for fear that slurm* daemons are going to crash. Slurm seems to be very unforgiving about incorrect lines in slurm.conf. Can you please advise? Thanks, Ole
Hi Thanks for pointing this up. You can safely use this option. This option blocks communication from all old components. This protects against the malicious use of cred obtained from old slurm demons. https://github.com/SchedMD/slurm/blob/3179211727802be5a1411375a80c8ccf2c59a205/src/common/slurm_protocol_api.c#L186-L187 Dominik
Hi Dominik, (In reply to Dominik Bartkiewicz from comment #1) > Thanks for pointing this up. > > You can safely use this option. This option blocks communication from all > old components. This protects against the malicious use of cred obtained > from old slurm demons. Thanks, sounds good. Can you please ensure that "block_null_hash" gets documented in the slurm.conf manual page? Thanks, Ole
After setting CommunicationParameters=block_null_hash in slurm.conf and doing scontrol reconfig, some nodes currently running jobs have RPC errors in slurmd.log, here node a001: [2022-05-05T13:08:02.735] [4970438.2] error: Rank 0 failed sending step completion message directly to slurmctld, retrying [2022-05-05T13:08:03.244] [4970438.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0 [2022-05-05T13:08:03.256] [4970438.batch] Retrying job complete RPC for StepId=4970438.batch [2022-05-05T13:08:18.269] [4970438.batch] Retrying job complete RPC for StepId=4970438.batch [2022-05-05T13:08:33.282] [4970438.batch] Retrying job complete RPC for StepId=4970438.batch The node a001 had these Slurm processes from a running job: [root@a001 ~]# ps auxw|grep slurm root 78976 0.0 0.0 278748 3440 ? Sl 01:01 0:01 slurmstepd: [4970438.extern] root 78984 0.0 0.0 268732 4156 ? Sl 01:01 0:01 slurmstepd: [4970438.batch] root 80831 0.0 0.0 406140 4288 ? Sl 01:03 0:05 slurmstepd: [4970438.2] root 154625 0.0 0.0 3469068 7212 ? Ss 08:19 0:04 /usr/sbin/slurmd -D -s So I have removed the CommunicationParameters=block_null_hash again. Now jobs can apparently complete without errors.
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #3) > After setting CommunicationParameters=block_null_hash in slurm.conf and > doing scontrol reconfig, some nodes currently running jobs have RPC errors > in slurmd.log, here node a001: > > [2022-05-05T13:08:02.735] [4970438.2] error: Rank 0 failed sending step > completion message directly to slurmctld, retrying > [2022-05-05T13:08:03.244] [4970438.batch] sending > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0 > [2022-05-05T13:08:03.256] [4970438.batch] Retrying job complete RPC for > StepId=4970438.batch > [2022-05-05T13:08:18.269] [4970438.batch] Retrying job complete RPC for > StepId=4970438.batch > [2022-05-05T13:08:33.282] [4970438.batch] Retrying job complete RPC for > StepId=4970438.batch > > The node a001 had these Slurm processes from a running job: > > [root@a001 ~]# ps auxw|grep slurm > root 78976 0.0 0.0 278748 3440 ? Sl 01:01 0:01 > slurmstepd: [4970438.extern] > root 78984 0.0 0.0 268732 4156 ? Sl 01:01 0:01 > slurmstepd: [4970438.batch] > root 80831 0.0 0.0 406140 4288 ? Sl 01:03 0:05 > slurmstepd: [4970438.2] > root 154625 0.0 0.0 3469068 7212 ? Ss 08:19 0:04 > /usr/sbin/slurmd -D -s > > > So I have removed the CommunicationParameters=block_null_hash again. Now > jobs can apparently complete without errors. Furthermore the slurmctld.log also show errors related to node a001: [2022-05-05T13:08:33.272] error: slurm_unpack_received_msg: Header lengths are longer than data received [2022-05-05T13:08:33.282] error: slurm_receive_msg [10.2.129.1:52624]: Unspecified error
Hi Can you check if slurmd on node a001 is updated? Is any job started before the update still running on this node? Dominik
(In reply to Dominik Bartkiewicz from comment #5) > Hi > > Can you check if slurmd on node a001 is updated? Yes, all nodes update this morning. The node a001 slurmd.log documents this: [2022-05-05T08:19:48.295] Slurmd shutdown completing [2022-05-05T08:19:48.427] Considering each NUMA node as a socket [2022-05-05T08:19:48.427] Considering each NUMA node as a socket [2022-05-05T08:19:48.429] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file. [2022-05-05T08:19:48.433] slurmd version 21.08.8 started [2022-05-05T08:19:48.433] slurmd started on Thu, 05 May 2022 08:19:48 +0200 [2022-05-05T08:19:49.706] CPUs=40 Boards=1 Sockets=4 Cores=10 Threads=1 Memory=385380 TmpDisk=145069 Uptime=688005 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2022-05-05T13:04:54.758] Considering each NUMA node as a socket [2022-05-05T13:04:54.760] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file. [2022-05-05T13:08:02.735] [4970438.2] error: Rank 0 failed sending step completion message directly to slurmctld, retrying [2022-05-05T13:08:03.244] [4970438.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0 [2022-05-05T13:08:03.256] [4970438.batch] Retrying job complete RPC for StepId=4970438.batch [2022-05-05T13:08:18.269] [4970438.batch] Retrying job complete RPC for StepId=4970438.batch [2022-05-05T13:08:33.282] [4970438.batch] Retrying job complete RPC for StepId=4970438.batch > Is any job started before > the update still running on this node? Of course! All our nodes were running jobs at the time that I upgraded to 21.08.8 on the nodes. Are you hinting that CommunicationParameters=block_null_hash must only be implemented while the entire cluster is down? Thanks, Ole
As I mentioned, this option blocks communication from all unpatched components (eg.:slurmstepd). Slurmstepd tries to send REQUEST_STEP_COMPLETE/REQUEST_COMPLETE_BATCH_SCRIPT to slurmctld when a step is finished.
(In reply to Dominik Bartkiewicz from comment #7) > As I mentioned, this option blocks communication from all unpatched > components (eg.:slurmstepd). Slurmstepd tries to send > REQUEST_STEP_COMPLETE/REQUEST_COMPLETE_BATCH_SCRIPT to slurmctld when a step > is finished. Yes, this makes sense and probably explains the errors we are seeing. The question remains: What is a working procedure for implementing CommunicationParameters=block_null_hash in slurm.conf safely? IMHO, such a safe procedure ought to be documented by SchedMD. Thanks, Ole
This is a reason why this option is not a default. Update, wait until old steps end and enable option. I am checking if we can make this option less restrictive and block all rpcs excluded those sent from stepd
Hi We added documentation for block_null_hash. https://slurm.schedmd.com/slurm.conf.html#OPT_block_null_hash Dominik
Hi Let me know if there is anything else I can do to help or if this ticket is ok to close. Dominik
Hi Dominik, (In reply to Dominik Bartkiewicz from comment #14) > Let me know if there is anything else I can do to help or if this > ticket is ok to close. Thanks for the update. Please close this ticket. Best regards, Ole