Cluster allocation and priorities

Revision as of 14:23, 29 April 2020 by Desantis (talk | contribs) (Created page with "==== Research Computing allocation and priority procedure ==== <br /> Condominium clusters have become the standard for central campus high performance computing resources. I...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Research Computing allocation and priority procedure


Condominium clusters have become the standard for central campus high performance computing resources. In this model the cost of the cluster is shared between the university and researchers. The primary responsibility of the researchers in this model is to provide funding for computational resources which are then placed into the central cluster. Scheduling software then allows these resources to be shared with other university members when they are not in use by the contributing researchers. To provide a viable computing environment for a large research university it is important to balance the goals of making usage, especially for students, as friction-free as possible while providing a level of service to contributing researchers to make inclusion in a central cluster a value-added proposition. It is the purpose of this document to balance these goals by providing guiding principles for the administration of the USF CIRCE's queuing system configuration.

Definitions


  • Active Contributor

A member of the USF community that has funded hardware to the central cluster that is equal to or greater than five node equivalents (Defined below). Active Contributor status begins when the hardware is put into production and lasts four years. At the of the four years, the contributed hardware will transition into the general-purpose resource pool.

  • Contributor

Any member of the USF community that has funded hardware in the central cluster.

  • Contributor Group

It is unusual for a contributor to be working alone. The prototypical group is a faculty member and his or her postdocs and students. Other group configurations are possible.

  • Standard User

Any member of the USF community needing to use the central cluster who is not a contributor

  • Node Equivalent

Although form factors may change, a node equivalent for the purposes herein will refer to a system or portion of a system that is equal in processing power to two current multi-core CPUs with Infiniband connectivity and a minimum of 2GB of RAM per processor. Contribution comes with an increased disk allotment. Alternate hardware configurations of specialty hardware will be considered on a case-by-case basis.

  • Preemption

The process of canceling currently running jobs for those submitted by users with elevated access.

  • Priority

Scheduling preference assigned to a computational job.

Models of Contribution


  • Resource Sharing

In this model the active contributor receives higher priority and preemption rights. The boost in priority is in effect for the whole cluster. Preemption rights are restricted to the contributed resources. In this case the contributor can use all cluster resources at the higher priority effectively multiplying their contribution. After a period of 4 years, Research Computing reserves the right to place the contributed hardware into the general, open access pool without restriction.

  • Non-Sharing

Some computational tasks, especially those handled under contract, must begin within specific time frames. In this case the contributor can have sole access to the contributed hardware, but does not get access to the rest of the cluster. In this case neither the contributor or their research group will gain any priority or preemption rights on the rest of the cluster. Again, after a period of 4 years Research Computing reserves the right to place the contributed hardware into the general, open access pool without restriction.

Priority Assignment Algorithm


There are three levels of priority. The higher the priority the faster a job gets dispatched by the scheduling system. The three levels of base priority are in decreasing order, Active Contributor, Contributor and Standard. The priority assignments will be modified by the scheduler to ensure that there is an equal distribution of resources within each of the priority levels.

The scheduler will use fair share scheduling that reduces the priority based on the amount of resources used in a recent epoch (Two weeks). This will ensure job dispatch is even between users in the same priority level. In addition, to provide relief from job starvation, the cluster alters the priority of jobs based on time waiting in the queue, and the size of the job, including runtime request These adjustments, except in extreme circumstances will not alter the priorities enough to move a job from one priority level to another. This means that automated priority changes affect job dispatching only in relationship to other jobs in the same priority level. All members of a Contributor group have the same priority.

Preemption


Active Contributors have the ability to preempt jobs running on resources equivalent to the hardware that they have contributed. If an active contributor's job submitted to the cluster does not start within two hours the system will free resources by canceling any job(s) running on the contributed hardware that is not associated with a member of the contributor's group. In most circumstances this will allow the Active Contributor's job to start. However, the preemption process does not affect jobs submitted by a member of the contributor group. This means that the preemption process may not be able to free sufficient resources if the resources are in use by members of the contributor group.