Difference between revisions of "SLURM Partitions"
Line 41: | Line 41: | ||
|8 | |8 | ||
|ib_qdr, ib_psm, sse4, sse41, sse42, wh, cpu_xeon, xeon_E5649 | |ib_qdr, ib_psm, sse4, sse41, sse42, wh, cpu_xeon, xeon_E5649 | ||
| | |Tampa | ||
|- | |- | ||
|24GB | |24GB | ||
Line 51: | Line 51: | ||
|n/a | |n/a | ||
|ib_qdr, ib_psm, sse41, sse42, avx, wh, cpu_xeon, xeon_E52630 | |ib_qdr, ib_psm, sse41, sse42, avx, wh, cpu_xeon, xeon_E52630 | ||
| | |Tampa | ||
|- | |- | ||
|24GB | |24GB | ||
Line 61: | Line 61: | ||
|n/a | |n/a | ||
|ib_qdr, ib_ofa, sse41, sse42, avx, wh, cpu_xeon, xeon_E5649 | |ib_qdr, ib_ofa, sse41, sse42, avx, wh, cpu_xeon, xeon_E5649 | ||
| | |Tampa | ||
|- | |- | ||
|32GB | |32GB | ||
Line 71: | Line 71: | ||
|40 | |40 | ||
|ib_qdr, ib_psm, sse4, sse41, sse42, avx, wh, cpu_xeon, xeon_E52670, gpu_K20 | |ib_qdr, ib_psm, sse4, sse41, sse42, avx, wh, cpu_xeon, xeon_E52670, gpu_K20 | ||
| | |Tampa | ||
|- | |- | ||
|512GB | |512GB |
Revision as of 15:54, 3 January 2017
Queue Layout
It is really no longer necessary to discuss queues in the traditional sense. In the past, we would create queues based on pools of hardware resources. If a user wanted to utilize a particular hardware resource, he or she would request the appropriate queue. Most times, however, what the user wants and what is best for the user or what is best for all users are not necessarily the same. Allowing individuals to dictate where their jobs will run will inevitably lead to throughput problems since it would be unreasonable to expect the users to understand the complete state and behavior of the scheduler.
Below is a general description of how jobs make their way through the queue. Please see Scheduling and Dispatch Policy for more information.
When the scheduler finds a queue in which a job is eligible to run (based on requested runtime, time
), it then determines if the requested hardware requirements of the job (see Using Features) match up with the resources the queue provides. If it does, the job is executed if there are available resources. If there are no available resources, the job will be held until the next scheduler iteration, to see if resources have become available.
“Available resources” include processors and memory. Processors generally match up to the number of slots in a given queue while memory is defined as a complex value which may not be so obvious to query. If your job is waiting in the qw
state, it is likely that either the slots requested or the memory requested are beyond what the system can provide at that particular point in time.
The following node sets are available:
Memory | CPU | Cores | Interconnect | Nodes | Slots | GPUs | Complex Flags | Location |
24GB | Opteron 2384 | 12 | 4x DDR IB | 32 | 384 | n/a | ib_ddr, ib_psm, tpa, sse4, sse4a, cpu_amd, opteron_2384 | Tampa |
24GB | Xeon E5649 | 12 | 4x QDR IB | 107 | 1284 | 8 | ib_qdr, ib_psm, sse4, sse41, sse42, wh, cpu_xeon, xeon_E5649 | Tampa |
24GB | Xeon E5-2630 | 12 | 4x QDR IB | 67 | 804 | n/a | ib_qdr, ib_psm, sse41, sse42, avx, wh, cpu_xeon, xeon_E52630 | Tampa |
24GB | Xeon E5649 | 12 | 4x QDR IB | 14 | 168 | n/a | ib_qdr, ib_ofa, sse41, sse42, avx, wh, cpu_xeon, xeon_E5649 | Tampa |
32GB | Xeon E5-2670 | 16 | 4x QDR IB | 129 | 2064 | 40 | ib_qdr, ib_psm, sse4, sse41, sse42, avx, wh, cpu_xeon, xeon_E52670, gpu_K20 | Tampa |
512GB | Xeon E5-2650 | 20 | 4x QDR IB | 3 | 60 | n/a | ib_qdr, ib_psm, tpa, sse4_1, sse4_2, avx, avx2, gpfs, cpu_xeon, xeon_E52650, mem_512G | Tampa |
Total | 4764 | 48 |
Nodes that are in Tampa have slower access to the /work
storage. You can make a request to use Winter Haven nodes mandatory or hard by specifying:
#SBATCH --constraint=wh
in your job submit script.
The node sets are associated with the following queues:
Queue Name | Max Runtime | Notes |
circe | infinite | default general-purpose queue |
development | 2 hours | short-run CIRCE nodes. 4 nodes maximum per job, maximum 3 jobs per user, total of 32 cores all jobs; access with “development” QOS |
cuda | infinite | CUDA GPU nodes |
devgpu | 3 hours | short-run CUDA GPU nodes |
himem | 1 week | large memory job queue (>= 24 GB) |
About Preemption:
Some hardware on CIRCE is provided by research contributors. This hardware is available for use by all CIRCE users by specifying the partition (example: sbatch -p hii01 ./submit-script.sh). The caveat however is that as this is contributor hardware, non-contributor jobs running on this partition are subject to preemption.
There is specific grace period (defined above, per partition) before a contributor’s job(s) will cancel the non-contributors job(s). This means that any user taking advantage of the hardware should have some kind of check-pointing enabled, so that interrupted jobs can be re-submitted without needing to start over.