SLURM Job Requirements

Revision as of 16:48, 28 January 2022 by Wdoyle2 (talk | contribs) (→‎Determining Your Job’s Resource Requirements)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Job Requirements

The most important part of the job submission process, from a performance perspective, is understanding your job’s requirements i.e. run-time, memory requirements, disk and I/O requirements, interconnect requirements, etc. Based on this understanding, you need to tell the scheduler what it is you need for your job in order for it to run as efficiently as possible.

Job Runtime

All jobs are required to have a hard runtime specification. Jobs that do not have this specification will have a default runtime of 1 hour and will be stopped at that point.

Users should ensure that they let the scheduler know the estimated run time of their jobs by including the following option in their submit scripts:

#SBATCH —time=XX:XX:XX

where XX:XX:XX is the hours, minutes and seconds the job is expected to run.

The general rule of thumb is that the shorter your job runtime, the more quickly your job should be started.

Determining Job Runtime

Only benchmarking or profiling your code will provide a reasonable time estimate, but there are some rules you can go by when attempting to make an educated guess:

Embarrassingly Parallel or Course-Grained tasks tend to scale linearly or near-linearly. Generally, you can divide the time it takes to run on one processor by the number of processors you are planning to run on.

Finer-Grained computations generally follow some scaling curve where after some point, adding additional resources does not yield any appreciable speedup. Depending on the parameters passed to the program, there may be no established point of reference for a reasonable time table. The best thing to do in this case is benchmark the code given the same input parameters but perhaps with fewer repeated iterations, time steps, etc. Get an idea of how the application behaves with smaller, shorter-running jobs, then make a reasonable estimate of how the runtime will change as you increase your iterations, time-steps, etc.

Determining Your Job’s Resource Requirements

Many pieces of code available today come with sufficient documentation that explains the requirements of the program given certain input parameters and how those requirements change as the size of your problem changes. You can usually fudge these rough estimates into working resource requests that will, for the most part, ensure ample resources are provided to your job. What if the documentation does not provide such information or is unclear? There are a couple of methods at your disposal for revealing your job’s requirements:

1. Brute force methods would involve benchmarking the code under various conditions and analyzing the results to determine the best mix of memory, CPU and interconnect.

2. A more analytic method would include the use of compiler and code tools such as system monitors to watch resource utilization during runtime and code profilers to determine performance intensive areas of code and how they utilize the system. Using the data gathered from these tools with a general understanding of the code will yield very accurate predictions as to the code’s behavior given various input parameters.

Obviously, having good documentation is always preferred, but is often a luxury not always available. A brute-force method is good if your use of a code will not constitute an appreciable portion of your research time or if the code is something you will only use for a short period. The best method, in any case, is an analytic method.

See SLURM_Job_Prototyping for more information.

Tools for Profiling Code

The Profiler

A good code profiler is a must-have in any programmer’s toolbox, but its also useful for researchers running HPC codes. Three profiling tools that are available on CIRCE/Student clusters are:

  1. pgprof: Portland Group’s (PGI) graphical code profiler
  2. gprof: GNU Profiler, a command-line, GNU-compatible profiler
  3. vtune: Vtune Performance analyzer, Intel’s code profiler

Profilers insert code into your application that write out valuable statistics about the run-time of individual function calls and even individual lines of code. This allows you to track down which functions are taking up the largest share of run-time during the execution of your program. By adjusting input parameters that affect these functions, you can get an idea of how your simulations may affect the performance of the application. For the budding computational scientist, the tool provides a means to track down inefficient code that may be tuned or re-factored for greater efficiency.

Monitor Programs

Programs like ‘top’ [man (1) top] allow you to take periodic snapshots of a process during execution (usually every second) to continuously monitor CPU utilization, memory usage, and the process state (running, sleeping, waiting for disk, writing to disk, etc.) To run top and view the details of your application, you’ll need to know what node your process is running on. For a serial process, this is pretty straight forward:

[user@login0 ~]$ squeue -o "%.18i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R %S" -u user
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON) START_TIME
             40697     circe test-job     user  RUNNING       0:02     30:00      1 svc-3024-6-25 2015-02-19T13:10:50

In this case, the job is running on svc-3024-6-25. You can ssh into the host and run ‘top’ like so:

[user@login0 ~]$ ssh svc-3024-6-25
[user@svc-3024-6-25]$ top
top - 13:18:18 up 76 days, 23:41,  1 user,  load average: 5.45, 5.17, 7.58
Tasks: 383 total,   6 running, 377 sleeping,   0 stopped,   0 zombie
Cpu(s): 38.1%us,  0.2%sy,  0.0%ni, 57.4%id,  0.0%wa,  0.0%hi,  4.3%si,  0.0%st
Mem:  24605092k total,  4292884k used, 20312208k free,   199624k buffers
Swap: 26836984k total,   294368k used, 26542616k free,   777608k cached

  PID USER      PR  NI  VIRT   RES  SHR S %CPU %MEM    TIME+  COMMAND
 2859 user      15   0 19408 43632  844 R   99  1.1   0:03.91 myprocess
...

Here, you can see that your process is particularly CPU-bound. It is in a running state, consuming 99% of the system’s CPU resources while using only 1% of the system’s memory. The requirements for this sort of process would be relatively straight forward.

How about when things get a little bit more intense?

[user@login0 ~]$ ssh svc-3024-6-25
[user@svc-3024-6-25]$ top
top - 13:18:18 up 76 days, 23:41,  1 user,  load average: 5.45, 5.17, 7.58
Tasks: 383 total,   6 running, 377 sleeping,   0 stopped,   0 zombie
Cpu(s): 38.1%us,  0.2%sy,  0.0%ni, 57.4%id,  0.0%wa,  0.0%hi,  4.3%si,  0.0%st
Mem:  24605092k total,  4292884k used, 20312208k free,   199624k buffers
Swap: 26836984k total,   294368k used, 26542616k free,   777608k cached

  PID USER      PR  NI  VIRT   RES  SHR S %CPU %MEM    TIME+  COMMAND
 2859 user      15   0 9112m 8912m  844 D   10 96.1   0:05.21 myprocess2
...

Here, we have a problem. You can see the state (the column ‘S’) is ‘D’. This indicates that the process is waiting for either a read or a write to or from the disk. If we look at the MEM column, we see that 96.1 of the system’s main memory has been utilized by the application. And to top things off, nearly 1GB of swap space is being used. Essentially, this program is using so much memory that:

  1. It is having to use disk swap to store temporary data needed for execution
  2. It is causing the rest of the system (kernel, services, etc.) to swap in order to make room for the application and its dataset

In either case, this is bad news for our application and its performance. We need a system with more memory.

Since the current compute node has 8GB of RAM and the process is apparently using >= 9GB of RAM, then we have a good basis for making a resource request:

#SBATCH --mem=10240

This tells the scheduler to only run your job where at least 10GB of RAM is free on the node to be used by your process. Your job will not run unless there are hardware resources available to satisfy the request. When the resources are found, say a node with 16GB of RAM, this particular job will run many orders of magnitude faster than on a swapped-out machine as in the example.

Note: More information to come.