FAQ

CIRCE Frequently Asked Questions

This page contains a list of questions that we receive quite often from our users. We’ll attempt to answer these questions as best as possible so that you can come here for quick answers. If there are other questions you’d like to see added to this FAQ, please send them to rc-help@usf.edu.

Q. Help! I can’t log in!

A. Here are some possibilities:

  1. Have you changed your NetID password lately (E-Mail, Blackboard, etc.)? If you have, make sure you are using the same password when you log into CIRCE
  2. Have you forgotten your password? You can reset it at https://netid.usf.edu
  3. Are you no longer a student? Send an E-Mail to rc-help@usf.edu saying that you’re a former student and that you’d like access to your Research Computing data. Please provide your USF NetID that you used to use to access your account.

Q. What is this “queue” thing I keep hearing about and why do I have to use it?

A. The queue is like any other queue: it’s simply a waiting line for people that are wanting access to limited resources. Whether its a bank teller, the nice person at the DMV, or a 32 processor server with 512GB of RAM, resources must be managed. You’ll want to read our Guide to SLURM for a better description.

Q. How do I run graphical applications like Matlab, Comsol, p4vasp, PyMol, or CUBIT?

A. Please visit the CIRCE/SC Desktop Environment page for documentation on how to connect to CIRCE/SC and run graphical applications.

Q. I’ve submitted a job to the cluster and I’m getting errors about “Command not found” or “Syntax Error”. The script looks fine when I edit it. What is the problem?

A. A number of different problems can cause these messages. The most frequent are listed below:

  1. Did you create your submit script from within the Windows editor Notepad? If so, please ensure that you convert your submit script with the dos2unix command by running:
    [user@host ~]$ dos2unix <script_here>
    where <script_here> is the name of your job script. This will convert the file to the proper UNIX file format that SLURM recognizes.
  2. Does your submit script contain the appropriate module lines? Please see the documentation for your respective application for the appropriate module lines to add.
  3. Are you actually referencing a file that exists? You can make sure that the file exists by doing:
    [user@host ~]$ ls <path_to_file>
    where <path_to_file> cannot find the file, then chances are it doesn’t exist on the nodes either.

Q. I just deleted an extremely important file from my home directory and I’m defending my thesis next week! Please tell me you have a backup copy!

A. Yes, we have at least 14 days of incremental backups to rely on. Please provide the path and file name and send a request to rc-help@usf.edu in order to request a restore.

Q. I try to run Ansys/Matlab/Mathematica/Gambit on CIRCE/SC and it fails, saying “This model requires more scratch space than available” or “Out of memory”. Why is this happening?

A. There are a couple reasons why this happens:

  1. This happens if you are trying to run the command directly from a login machine (your prompt will look like this):
    [user@login0 ~]$
    To be fair to other users, we have resource restrictions on our login machines limiting the amount of RAM and CPU time you can use. We recommend running these applications from an srun session:
    [user@host ~]$ srun —time=HH:MM:SS —pty /bin/bash
    where --time= is the hours, minutes, and seconds that you believe you will need to finish working with an application. With srun, you will be guaranteed the resources needed to complete your work.
  2. This can also happen if you are within a srun session and you did not specify a memory request. Try again with a reasonable estimate with how much memory (in megabytes) you think you need (for example, 4096 Megabytes, or 4 Gigabytes):
    [user@host ~]$ srun —time=HH:MM:SS —mem=4096 —pty /bin/bash

Q. I submitted my job and it has been in the ‘PD’ (pending) state for a long time. Why isn’t my job being run?

A. There are a few reasons why this might happen:

  • Did you specify a queue? Don’t do this! If all of the slots in the queue are occupied, you have to wait until enough are freed to start your job. Removing the queue definition in your submission script will allow your job to use many more resources and will increase your overall job throughput.
  • Did you make a reasonable resource request? If your submission script calls for 24 processors, 96 GB of RAM, and 2 nodes, or 4096 processors and 10TB of RAM, or 1 week of requested runtime, you’ll incur scheduling delays due to the specificity and/or size of the request, as the scheduler will need to allocate the resources while contending with other jobs and their requests. To see our available hardware pool in order to make good decisions about resource requests, see the following guides:
Imagine randomly walking throughout a theme park, without any agenda. You will see various lines at all rides and attractions. Most likely, you'll opt to stand in the shortest lines, even if means initially passing by your most desired ride and/or attraction, to take advantage of the shortest waiting period. Then, after some time has passed you'll be able to return and experience the ride and/or attraction when the waiting period is acceptable. Now, imagine sticking to an agenda of rides and/or attractions to see, in a specific order. Instead of skipping long lines, you will remain and experience longer waiting periods. Depending on how busy the park is, you may or may not get to experience all rides and/or attractions on your list for the day.
In some cases, it does make sense to be very specific with your submission requests, e.g. for a specific GPU or CPU, where the "wait" is worth additional job pending time. But, if your application doesn't require specificity, it is best to allow the scheduler to pick readily available resources for you.
  • Contribution status
A major factor in determining your job's priority is based upon your research group's and/or faculty sponsor's contribution status. Your job's base priority is configured via a default QOS:
Your default QOS combined with your resource request could result in longer than expected job pending times.
Contributor status vs. non contributor status
Imagine going to a theme park on a very busy day with the least expensive pass possible. You will experience long waiting periods at all rides and attractions. But, if the park isn't as crowded, you will experience shorter waiting periods. This is for all intents and purposes, what your jobs will experience on the cluster without preferential scheduling.
Now, imagine if you paid for a more expensive pass. You will receive preferential treatment at each ride and attraction, with the obvious benefit of extremely short waiting periods. But, there is a chance that even on crowded days with a more expensive pass you will experience longer than expected waiting periods.

Q. My job keeps terminating with no indication of anything wrong. What gives?

A. By default, all SLURM jobs have a runtime of 1 hour, after which the scheduler will send a termination signal. Have you specified a runtime on your job?

--time=08:00:00

This option would request a runtime of eight hours. The --time or -t option is required on both submitted jobs and interactive jobs. It is also possible that you did not specify enough time when making your request. Please see this guide on determining job run-time for help: Job Runtime Guide

Q. I’ve submitted my job to the partitions utilizing Omni-Path and I received strange PSM/boot qp to RTR errors, or PSM found 0 available contexts on InfiniPath device(s). (err=21)! The job runs without an issue on the QDR Infiniband partitions, though! What's wrong?

A. There are some applications compiled with older versions of the Intel compiler suite (=< 2017_cluster_xe), relying on Intel's MPI implementation which aren't capable of utilizing the Omni-Path interconnect. As a result of this incompatibility, said applications simply fail. However, there is a compatibility library available that allows the applications to function as expected. You can take advantage of the PSM2 compatibility library by ensuring it is added to your environment after loading your application module file(s) but before the binary is invoked:

#SBATCH --job-name=Test

module purge 
module load apps/app/version
export LD_LIBRARY_PATH=/usr/lib64/psm2-compat:$LD_LIBRARY_PATH

mpriun binary

Q. Why do I get redirected while using Internet Explorer to download software from the RC isos site?

A. The version of Internet Explorer you are using has a bug and is unable to download large files. Because of this, we have redirected users of Internet Explorer to this FAQ to explain the issue. You will be able to access the ISOS site and download the files but you must users browser, like Firefox, Mozilla, Chrome, Opera, or Safari. Here is a link to a Microsoft KB article discussing the issue in more detail: http://support.microsoft.com/kb/298618.

Q. I’ve read all of this stuff and I’m still having trouble. Who do I contact for help?

A. Send an e-mail to rc-help@usf.edu with a detailed description of your problem. Include any input or output files, any error messages or warnings. In addition, user education and training sessions are also provided upon request.