slurmstepd: error: *** JOB 3616403 ON t03 CANCELLED AT 2025-02-07T10:54:02 *** -------------------------------------------------------------------------- PRTE has lost communication with a remote daemon. HNP daemon : [prterun-t03-2140993@0,0] on node t03 Remote daemon: [prterun-t03-2140993@0,1] on node t04 This is usually due to either a failure of the TCP network connection to the node, or possibly an internal failure of the daemon itself. We cannot recover from this failure, and therefore will terminate the job. --------------------------------------------------------------------------