Submitting a batch job¶
The next step is to submit this batch job to the cluster.
Within the Class_Examples folder we unzipped you will find a copy of our submit script we created in the Estimating job resources section.
Here is how it looks:
#!/bin/bash -l
# Request 3 CPU's for the job
#SBATCH --cpus-per-task=3
# Request 1GB of memory for the job
#SBATCH --mem=1GB
# Walltime (job duration)
#SBATCH --time=00:05:00
# Then finally, our code we want to execute.
python2 invert_matrix.py
Note
All of the lines that begin with a #SBATCH are directives to Slurm. The meaning of the directives in the sample script are exampled in a comment line that precedes the directive. The full list of available directives is explained in the man page for the sbatch command which is available on discovery.
sbatch will copy the current shell environment and the scheduler will recreate that environment on the allocated compute node when the job starts.
Note
The job script does NOT run .bashrc or .bash_profile, and so may not have the same environment as a fresh login shell. This is important if you use aliases, or the conda system to set up your own custom version of python and sets of python packages. Since conda defines shell functions, it must be configured before you can call, e.g. conda activate my-env The simplest way to do this is for the first line of your script to be:
#!/bin/bash -l
which explicitly starts bash as a login shell
Now submit the job and check its status. We use sbatch to submit the job:
[john@x01 Class_Examples]$ sbatch python_job.sh
Submitted batch job 2628650
To see a quick view of the job status you can issue the squeue -u
[john@x01 Class_Examples]$ squeue -u john
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2628650 standard python_j john R 0:49 1 j03
If you would like to see detailed information of the job. You can use the scontrol
command. Scontrol will provide information like what node my job is running on, how much walltime my job has left, and other useful information like the resources requested, and more.
[john@x01 Class_Examples]$ scontrol show job 2628650
JobId=2628650 JobName=python_job.sh
UserId=john(48374) GroupId=rc-users(480987) MCS_label=rc
Priority=1 Nice=0 Account=rc QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:03:05 TimeLimit=00:05:00 TimeMin=N/A
SubmitTime=2022-05-10T20:33:20 EligibleTime=2022-05-10T20:33:20
AccrueTime=2022-05-10T20:33:20
StartTime=2022-05-10T20:34:07 EndTime=2022-05-10T20:39:07 Deadline=N/A
PreemptEligibleTime=2022-05-10T20:34:07 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-05-10T20:34:07 Scheduler=Main
Partition=standard AllocNode:Sid=x01:36127
ReqNodeList=(null) ExcNodeList=(null)
NodeList=j03
BatchHost=j03
NumNodes=1 NumCPUs=3 NumTasks=1 CPUs/Task=3 ReqB:S:C:T=0:0:*:*
TRES=cpu=3,mem=12G,node=1,billing=3
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=3 MinMemoryCPU=4G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/dartfs-hpc/rc/home/p/d18014p/Class_Examples/python_job.sh
WorkDir=/dartfs-hpc/rc/home/p/d18014p/Class_Examples
StdErr=/dartfs-hpc/rc/home/p/d18014p/Class_Examples/slurm-2628650.out
StdIn=/dev/null
StdOut=/dartfs-hpc/rc/home/p/d18014p/Class_Examples/slurm-2628650.out
Power=
JOBID is the unique ID of the job – in this case it is 2628650. In the above example I am issuing scontrol to view information related to my job
The output file, slurm-2628650.out, consists of three sections:
- A header section, Prologue, which gives information such as JOBID, user name and node list.
- A body section which include user output to STDOUT.
- A footer section, Epilogue, which is similar to the header.
If we use the commandcat
short for con·cat·e·nate, we can see a that the output file is exactly as we would expect it to be:
[john@x01 Class_Examples]$ cat slurm-2628650.out
i,mean 50 -1.09670952875e-17
i,mean 100 3.04038500105e-16
i,mean 150 1.24104227901e-17
i,mean 200 -1.60139176225e-16
i,mean 250 -1.17287454488e-16
i,mean 300 2.94829036507e-16
i,mean 350 4.66888358553e-17
i,mean 400 3.752857595e-15
i,mean 450 2.60083792553e-18
i,mean 500 2.05032635526e-16
i,mean 550 -3.80521832845e-16
i,mean 600 -3.07765049942e-17
i,mean 650 -9.39259624383e-17
i,mean 700 2.81309843854e-16
i,mean 750 -4.91502224946e-17
i,mean 800 7.35744459606e-17
i,mean 850 -5.23231103131e-18
i,mean 900 -5.52926185394e-17
i,mean 950 -3.26360319077e-16
i,mean 1000 -1.39343172417e-17
Here are some other useful commands to remember when interfacing with the scheduler.
sbatch sbatch
squeue squeue shows status of Slurm batch jobs
srun srun --pty /bin/bash runs an interactive job
sinfo sinfo show information about partitions
scontrol scontrol show job
scancel scancel