Skip to content

Estimating job resources

When trying to estimate how many resources to submit for you can use the interactive srun command.

In a terminal connected to a login node, or in a new terminal launch a 5 core interactive job run via:

srun --cpus-per-task=5 --pty /bin/bash
Once you have landed on your new node. Go ahead and activate your newly created conda enviornment:

conda activate discovery_class

To get started, please copy a small zip folder containing some python code and a sample submit script.

cp /dartfs-hpc/admin/Class_Examples.zip . && unzip Class_Examples.zip
The above command will copy the .zip file called Class_Examples from the location /dartfs-hpc/admin. The . instructs the copy to your current working directory. The && instructs to run the next command, which is to unzip the contents of the folder into the directory you are in.

When estimating your resource utilization you can use a program like top to monitor current utilization.

In this tutorial lets open two terminals side by side.

In one terminal we will launch our python code from the folder we unzipped. In the other terminal we will run the command top -u <username> to look at resource utilization.

In your first terminal take notice of the host your srun job landed on. You can see this by the change in the prompt:

[john@t04 ~]$
In my case my interactive job landed on t04.

In my second terminal i'll ssh to t04 directly.

ssh t04

Now that we are setup with two terminals side by side on the same host. (one terminal a job, the other terminal direct ssh). Next, within the Class_Examples folder is a basic python script we will use for estimating resources. The script is called invert_matrix.py. Lets run the script from our first terminal, the one we created the interactive job in, and see what it does.

cd Class_Examples
time python3 invert_matrix.py

Screenshot

Once your python command is executing like the above image, use the second terminal you opened to run the top command. My username is john so my command will be:

top -u $(whoami)

You can get out of top by simply hitting the letter q.

The next top screen will display information about the state of the system but also information like the number of CPUs, amount of system memory, and other useful information. In this case we are looking at two fields in particular. CPU% & RES short for reserved memory.

Screenshot

From the above the CPU% column is showing 97-99%. That is equivalent to 1 CPUs. With this information I know to submit my job for 1 CPU in order for it to run efficiently.

In the other col RES we can see that we are not using quite a full GB of memory. We know from this output that requesting the minimum for a job of 8GB will be sufficient for our job. (or lower)

The next resource you should consider estimating before subming your job is walltime walltime is used to determine how long your job will run for. Estimating accurate walltime is good scheduler ettiquite.

Screenshot

From the output above you will want to look at the real field. This is the time passed between pressing the enter key and the termination of the program. At this point, we know that we should submit for at least 5 minutes of walltime. That should allow enough time for the job to run to completion.

Note

Determining walltime can be tricky. To avoid potential job loss it is suggested to add 15-20% more walltime than jobs typically need. This will ensure jobs have enough walltime to complete the task. So if your job takes 8 minutes to complete, submit for 10.

Now that we have all of this information about the job we are ready to build our first submit script for submitting in batch to the scheduler.

#!/bin/bash

# Request 5 CPU's for the job
#SBATCH --cpus-per-task=1

# Request 8GB of memory for the job
#SBATCH --mem=8GB

# Walltime (job duration)
#SBATCH --time=00:05:00

# Then finally, our code we want to execute.
time python3 invert_matrix.py

Before we move to the next portion of submitting the job via sbatch, lets adjust the script to use 5 cores instead of 1.

So that you do not have to open an editor an updated version of the script is in the folder. Go ahead and take a look at the file invert_matrix_5_threads.py:

cat invert_matrix_5_threads.py 
#!/usr/bin/python3
import os

# Set the number of threads to 5 to limit CPU usage to 5 cores
os.environ["OPENBLAS_NUM_THREADS"] = "5"  # For systems using OpenBLAS

# Now import NumPy after setting environment variables
import numpy as np
import sys

# Main computation loop
for i in range(2, 1501):
    x = np.random.rand(i, i)
    y = np.linalg.inv(x)
    z = np.dot(x, y)
    e = np.eye(i)
    r = z - e
    m = r.mean()
    if i % 50 == 0:
        print("i,mean", i, m)
        sys.stdout.flush()
Notice the line at the top it is set to:
os.environ["OPENBLAS_NUM_THREADS"] = "5"

Lets go ahead an run that script now to see if adding 5 cores speeds it up:

time python3 invert_matrix_5_threads.py

Was it faster?