Difference between revisions of "News"

(46 intermediate revisions by 3 users not shown)
Line 4: Line 4:
|'''Posted on:'''
|'''Posted on:'''
|'''Details'''
|'''Details'''
|- style="font-size:90%"
|'''06/02/2022 15:53 EDT'''
|'''SLURM interactive job behavior change, update'''
<br>
<blockquote>For example, if you need to request a 16 CPU interactive session you'd use the following syntax:
"srun -c 16 --pty /bin/bash"<br>
"srun --cpus-per-task=16 --pty /bin/bash"<br>
"srun --nodes=1 --cpus-per-task=16 --pty /bin/bash"<br>
The first and second examples will allocate 16 CPU's, potentially over multiple nodes.  The third example will only allocate 16 CPU's on a single node.<br>
</blockquote>
We would like to clear-up potential confusion with the examples above.  When using -c or --cpus-per-task, the total number of cores on a single node are taken into consideration.  Therefore, if you request more cores than available per node on any given partition [0][1], you will receive an error.  Instead, start with the total number of cores desired for the interactive job and then evenly divide those over a node count (-N or --nodes=).
For example, if an interactive session requires 48 cores on the default "circe" partition it will need to be divided over at least 2 nodes since the "circe" partition only has 24 cores total per node.  At a minimum the request would need to be the following:
srun -N 2 -c 24 -t 01:00:00 --pty /bin/bash
srun --nodes=2 --cpus-per-task=24 --time=01:00:00 --pty /bin/bash
Please note that requesting all 24 cores per node will result in longer job pending times due to the fact that an entire node is being reserved.  It is best to break this specific request over multiple nodes for faster dispatch times.  For example, the same 48 core request could be divided over 6 nodes:
srun -N 6 -c 8 -t 01:00:00 --pty /bin/bash
srun --nodes=6 --cpus-per-task=8 -t 01:00:00 --pty /bin/bash
Again, these changes have only affected parallel, interactive jobs.  Batch jobs are not affected.
|-
|- style="font-size:90%"
|'''05/31/2022 10:27 EDT'''
|'''SLURM interactive job behavior change'''
Research Computing administrators made a change to SLURM scheduling on May 25, 2022 at approximately 13:27 EDT; cgroups support was implemented.
Due to previous behavior seen on the cluster, without the implementation of cgroups users would have been able to (inadvertently) utilize resources which weren't requested, such as GPU's.  In a nutshell, this change will explicitly deny access to resources that weren't requested.
Unfortunately, this change has introduced some "breaking" behavior for interactive-only jobs (srun/salloc) in terms of how CPU's are requested; batch submissions _are not affected_.  For any users whose interactive workflows require access to more than a single CPU, the following flags cannot be used to request multiple CPU's any more:
'''"--ntasks="<br>
'''"-n"'''
Instead, please use the following flags to request CPU's:
'''"--cpus-per-task="<br>'''
'''"-c"'''
For example, if you need to request a 16 CPU interactive session you'd use the following syntax:
'''"srun -c 16 --pty /bin/bash"<br>'''
'''"srun --cpus-per-task=16 --pty /bin/bash"<br>'''
'''"srun --nodes=1 --cpus-per-task=16 --pty /bin/bash"<br>'''
The first and second examples will allocate 16 CPU's, potentially over multiple nodes.  The third example will only allocate 16 CPU's on a single node.
Any interactive workflows that require a node count are affected, too.  Previously, interactive jobs could request a combination of '''"--nodes=" and "--ntasks-per-node="'''.  With the new behavior, the following combinations are possible:
'''"--nodes=2 -c 12"<br>'''
'''"--nodes=4 -c 3"<br>'''
'''"--nodes=4 -c 2"<br>'''
'''"--nodes=1 -c 24"'''
The first example will request a total of 24 CPU's spread evenly over 2 nodes, the second example will request 12 CPU's spread over 4 nodes, the third example will request 8 CPU's spread over 4 nodes, and the last example will request 24 CPU's on a single node.
All interactive documentation on our Wiki site (https://wiki.rc.usf.edu) will be updated to reflect this change.
'''Again, batch submissions are not affected.  The behavior of "--ntasks", "-n", and "--ntasks-per-node" still functions as expected.'''
|-
|}
=== Archived News ===
{|class=wikitable
|- style="background-color:#f1edbe;text-align:center;font-size:90%"
|'''Posted on:'''
|'''Details'''
|- style="font-size:90%"
|'''05/19/2022 15:58 EDT'''
|'''Resolved: SLURM behavior change: exceeded memory limit'''
This issue has been corrected as of May 19, 2022 at approximately 10:58 EDT.
Ultimately, while not confirmed by the SLURM developers to us directly, there was a regression [0] between SLURM versions 20.11.3 and 20.11.9 which was not corrected until version 21.08.0rc2.  Given the concerning message "slurmstepd: error: Exceeded job memory limit", Research Computing administrators decided to perform another update of the SLURM software to the latest version - 21.08.8-2.
Any users whose output files contain the error message above during the time frame in question [1] can simply disregard the messages.
[0] Confirmed via commits to SLURM source code<br>
[1] Monday, May 16, 2022 at 17:00 EDT until May 19, 2022 at 10:37 EDT
|-
|- style="font-size:90%"
|'''05/18/2022 10:52 EDT'''
|'''SLURM behavior change: exceeded memory limit'''
Due to a security issue, Research Computing administrators upgraded the minor version of SLURM: 20.11.3 to 20.11.9 on Monday, May 16, 2022 from approximately 16:13 until 16:25 EDT.
The upgrade was performed online and no jobs were affected; it was a "text book" upgrade.
Unfortunately, a bug has manifest itself within the upgraded version and it has to do with user's memory requests.  The bug is random (yet frequent) in nature and jobs' memory requests are affected.  We have observed two causalities from this bug:
1.)  Jobs exceeding their memory limits aren't terminated when they should be and an incorrect job ID is sent to the controller.  This results in the jobs being able to continue to run since the controller attempts actions on an incorrect job ID (it does not exist).  These affected jobs will have the message "slurmstepd: error: Exceeded job memory limit" appear multiple times within output files, when it should only appear once.
2.)  Jobs that do not exceed their memory limit on a PER CPU basis (--mem-per-cpu) but whose total sum of process memory do exceed the PER CPU basis are improperly identified as having exceeded their memory limit.  These jobs are not terminated, but their output files do contain the same message listed above, "slurmstepd: error: Exceeded job memory limit".
We have reached out to the developers of the SLURM software for confirmation and corrective action.  Users do not need to change anything in their submission scripts at this time.
We do not expect there to be too many issues related to this issue, as it is random in nature and compute node health is monitored across the cluster.
Once we receive a confirmation and corrective action to take, we will post an update to the list.
|-
|- style="font-size:90%"
|'''05/10/2022 12:18 EDT'''
|'''Ethernet network issues post-mortem'''
Research Computing administrators have carefully reviewed all switch logs collected since the first Ethernet issue within the SVC data center (05/02/2022) and the second Ethernet issue within the MDC data center (05/07/2022).  A post mortem follows.
Initially, an STP hiccup appeared to be the culprit based upon perused console messaging.  Unfortunately, the console messaging buffers at that time were extremely small and administrators were unable to view all messages pertaining to the start of the event.  Additional logging was enabled on all switches after the network was restored to a nominal state.
When the second Ethernet issue occurred, collected logs indicated that an optional feature unrelated to STP had reached a pre-determined timeout, resulting in switch ports being disabled as a precaution!
Given the behavior of the optional feature and its pre-determined timeout, Research Computing administrators have disabled it across all switches as of Tuesday, May 10, 2022.  This will prevent further, similar issues from recurring on the Ethernet network.
|-
|- style="font-size:90%"
|'''05/05/2022 21:52 EDT'''
|'''amd_2021 partition'''
The amd_2021 partition has been updated to utilize the same QOS' as submitting to "circe" - none is required.  The appropriate starting priority and resource allocation is automatically granted based upon contribution status.
What does this mean?  It means that the use of the "preempt_short" QOS is no longer required to submit to the partition.  We ask that all amd_2021 partition users please update their submission script(s) and remove the QOS directive.  If the "preempt_short" QOS isn't removed, users will receive an error upon submission and their job(s) will be rejected.
|-
|- style="font-size:90%"
|'''04/22/2022 21:52 EDT'''
|'''CIRCE filesystem issue'''
At 21:52 EDT on 04/22/2022, RC administrators received alerts of an issue with the GPFS file system. The issue was tracked down to a malfunctioning Infiniband switch that caused fabric errors resulting in "Stale File Handle" errors across the CIRCE login and compute notes. During the troubleshooting process, ALL running/pending jobs on the CIRCE cluster were cancelled.
As of 23:09 EDT, the fabric is stable and all Infiniband network operations are nominal.
We ask that any users whose jobs were cancelled to please resubmit those jobs for processing.
|-
|- style="font-size:90%"
|'''04/04/2022 22:08 EDT'''
|'''Cooling issue within SVC data center'''
At approximately 22:00 EDT on Monday, April 4, 2022 Research Computing administrators were instructed to kill all running jobs on the following partitions, due to an emergency cooling issue within the data center:
circe
cool2022
hii02
hii-interactive
hii-test
himem
mri2016
In addition, the nodes were powered off to reduce the heat within the data center.  At 00:15 EDT on Tuesday, April 5, 2022, we received the all clear from facilities engineers and have restored cluster operations to the affected partitions.
|-
|- style="font-size:90%"
|'''03/28/2022 09:15 EDT'''
|'''NFS and CIFS server maintenance'''
Research Computing administrators have started the planned maintenance window to update a pair of NFS and CIFS servers. 
These updates are required to address a CIFS connectivity issue with USF's domain controllers, in addition to performance and security updates.  The work is expected to be completed by 17:00 EDT.  Access to \\cifs.rc.usf.edu will be interrupted during this time.
Unfortunately, one of these NFS servers provides access to /home and /shares to the following partitions:
* amd_2021
* bfbsm_2019
* cbcs
* cms_ocg
* hchg
* henderson_itn18
* margres_2020
* muma_2021
* simmons_itn18
* snsm_itn19
Because some user jobs will be negatively affected by not being able to access /home or /shares, a decision has been made to terminate all running jobs on the partitions listed above.  Users will be able to queue jobs during the maintenance window.
|-
|- style="font-size:90%"
|'''03/21/2022 01:02 EDT'''
|'''Resolved: Infiniband fabric issue in SVC'''
At 23:38 EDT on 03/20/2022, RC administrators received alerts of an issue with the GPFS file system. The issue was tracked down to a malfunctioning Infiniband switch that caused fabric errors resulting in "Stale File Handle" errors across the CIRCE login and compute notes. During the troubleshooting process, all running/pending jobs on the CIRCE cluster were cancelled. 
As of 01:02 EDT, the fabric is stable and all Infiniband network operations are nominal.
We ask that any users whose jobs were cancelled to please resubmit those jobs for processing.
|-
|- style="font-size:90%"
|'''03/20/2022 23:38 EDT'''
|'''Infiniband fabric issue in SVC'''
RC administrators are actively troubleshooting an issue with multiple Infiniband switches inside of the SVC data center that provides access to the following CIRCE filesystems:
* /home
* /work
* /shares
* /apps
Access to those file systems may be unavailable until service is restored.
RC administrators will provide updates when available.
|-
|- style="font-size:90%"
|'''12/16/2021 16:06 EST'''
|'''Job submissions restored'''
Research Computing administrators have restored job submissions to the following partitions as of Thursday, December 16, 2021 at 16:06 EST.
The applications below are still re-synchronizing, and should be finished within the next few hours:
* cadencetools
* matlab
<span id="BeeGFS_reformat_completed"></span>
|-
|- style="font-size:90%"
|'''12/16/2021 14:27 EST'''
|'''MAJOR BeeGFS maintenance completed'''
As of Thursday, December 16, 2021 at 10:43 EST Research Computing administrators have re-enabled file system access to /work_bgfs and /shares_bgfs across the cluster.  At this time users may start staging their data back to /work_bgfs; in fact, some users already have!  Quotas have been re-enabled.
For users whose data resided in /shares_bgfs - there is an active re-synchronization of cleaned (no chunk files) data in progress.  Please do not worry if files and/or directories seem missing, they will be restored within the next 24 hours.  Once all data has been synchronized, quotas will be re-enabled.
As of Thursday, December 16, 2021 at 13:55 EST access [0] has been restored to the student cluster.  All cleaned data which resided within /shares and /home has been restored from backup.  Permitted users will be able to login without an issue.  Quotas have been re-enabled.
CIFS access to /work_bgfs has been restored on \\cifs.rc.usf.edu.  CIFS access to various BeeGFS shares has been restored on \\cifs-pgs.rc.usf.edu.
The following CIRCE system partitions will remain in an inactive state until close of business today, in order to preserve as much bandwidth as possible for the aforementioned re-synchronization process.
* amd_2021
* bfbsm_2019
* bgfsqdr
* cbcs
* cms_ocg
* devel
* hchg
* henderson_itn18
* margres_2020
* muma_2021
* simmons_itn18
* snsm_itn19
* tfawcett
The following applications are still being synchronized to BeeGFS, and there will be intermittent access issues on the student cluster and the CIRCE partitions listed above until the re-synchronization is finished:
* cadencetools
* matlab
* synopsys
During the maintenance window, all cluster nodes housed in the aforementioned CIRCE partitions had their NFS mount points (/home and /shares) adjusted so that ACL's will be respected.  This avoids the issue of needing to explicitly change permissions on content generated by jobs.
The file system was also re-created with performance in mind, and Research Computing administrators hope that all users notice a performance boost once the file system is in a nominal state.
[0] On Monday, December 13, 2021 the authorized groups were rotated since the semester ended.
|-
|- style="font-size:90%"
|'''11/01/2021 16:00 EST'''
|'''MAJOR BeeGFS maintenance planned: December 13, 2021'''
The final process of recovery for the BeeGFS issues that began on October 9, 2021 will require a reformat and redeployment of /work_bgfs.  This maintenance window is scheduled to start December 13, 2021 and last until December 16, 2021.
This process will be no different than [https://wiki.rc.usf.edu/index.php/News#BeeGFS_reformat what we did previously].  The only real changes are the specific SLURM partitions affected, all of which are listed below:
* amd_2021
* bfbsm_2019
* bgfsqdr
* cbcs
* cms_ocg
* devel
* hchg
* henderson_itn18
* margres_2020
* muma_2021
* simmons_itn18
* snsm_itn19
* tfawcett
During this downtime, the SC student cluster will also be unavailable.
Ultimately, Research Computing staff have decided upon this plan of action so that workarounds will not be required to address current corruption concerns.  In addition, during this maintenance window the storage will be redeployed with additional performance in mind.
'''''We ask that all consumers of /work_bgfs begin to archive critical data that cannot be reproduced elsewhere.'''''
Please see the following URL for a quick "how to" regarding data management on CIRCE: https://wiki.rc.usf.edu/index.php/News#BeeGFS_reformat.  We advise users to not "blindly" synchronize data from /work_bgfs to /home, due to quotas.  Instead, tools such as `tar` should be utilized to create single, compressed archives that are of limited size.
We will send out reminders over the next month and a half.  Please do not hesitate to contact us at {{rchelp}} if there are any questions and/or concerns regarding this planned maintenance window.
|- style="font-size:90%"
|'''10/21/2021 22:05 EDT'''
|'''Resolved: Infiniband fabric issue in SVC'''
The Infiniband issue within the SVC data center has been corrected, and all systems are nominal at this time.  For interested users, a post mortem follows.
At 20:45 EDT Research Computing administrators received an alert that an Infiniband switch was no longer present on the fabric.  Several attempts were made to reboot the switch remotely, but physical access was required for a power cycle.  The switch was power cycled and immediately visible on the network at approximately 21:24 EDT, with no more reported network errors. But, the GPFS file system was still in a stale state.  Given this network issue, GPFS put itself into a fail safe mode to prevent corruption.  The fail safe was disabled at 21:31 EDT, and full connectivity to the file system was restored across the cluster.
The CIRCE compute nodes in the MDC data center were affected slightly longer due to NFS issuing "Stale file handle" messages on the /home and /shares mount points.  This issue was corrected at approximately 21:38 EDT.
We advise users who were running jobs during this time frame to check on their output.  If there are any issues, please kill any and all affected jobs and resubmit them.
|-
|- style="font-size:90%"
|'''10/21/2021 20:45 EDT'''
|'''Infiniband fabric issue in SVC'''
RC administrators are actively troubleshooting an issue with the Infiniband fabric inside of the SVC data center that provides access to the following CIRCE filesystems:
* /home
* /work
* /shares
* /apps
Access to those file systems on CIRCE and via cifs.rc.usf.edu may be unavailable until service is restored.
RC administrators will provide updates when available.
|- style="font-size:90%"
|'''10/09/2021 17:55 EDT'''
|'''BeeGFS oddities October 9, 2021'''
Research Computing staff have been made aware of minimal reports of file system oddities on /work_bgfs.  As a precaution, staff members are also scanning /shares_bgfs.
Administrators are currently investigating these reports on a case by case basis.  At this time however, the file system is operational and functioning.  Until more information is obtained, the file system will remain operational for all users.
Updates will be posted as they are received.  Should any users notice oddities with file access on /work_bgfs and/or /shares_bgfs, please do not hesitate to contact administrators via {{rchelp}}.  Please include the full path to affected file(s) and any messages received on the console.
|-
|- style="font-size:90%"
|'''09/09/2021 15:54 EDT'''
|'''New AMD SLURM partition: amd_2021'''
Research Computing is pleased to announce the release of new, general purpose AMD-based computational hardware via the partition "amd_2021".
Each node boasts 128 cores, 1 TB of RAM, and an Infiniband HCA capable of 100Gbp/s.  Additional details regarding the hardware and its partition are found via the following URL's:
* https://wiki.rc.usf.edu/index.php/CIRCE_Hardware
* https://wiki.rc.usf.edu/index.php/SLURM_Partitions#Per_Partition_Hardware
In order to access these resources, users must specify the "amd_2021" partition and the "preempt_short" QOS.  There will be no preemption despite the QOS, and the permitted QOS'es and logical configuration of the resources could change within a few weeks.  Research Computing staff will send out a notice to all users if and when these changes are expected.
Please note that some software that has been compiled with the Intel compiler suite may display error messages pertaining to no HFI devices being found.  This message is expected and innocuous as the next MPI capable fabric (OmniPath to Infiniband) is found and utilized.  In addition, some Intel-specific compiled software may fail with an "illegal instruction" error.  If this is the case, please make a note and email rc-help@usf.edu with the name of the application and job number and then resubmit the affected job(s) to other system partitions.
If there are any questions and/or comments, please email {{rchelp}}
|-
|- style="font-size:90%"
|'''10/02/2021 19:26 EDT'''
|'''File system event on /work_bgfs'''
At approximately 18:47 EDT this evening (Saturday, October 2, 2021), RC systems administrators received a page that one of the services of the BeeGFS file system was not responding. RC administrators were able to restore service at approximately 18:57 EDT.
During this window (18:47 to 18:57 EDT), any access to the /work_bgfs file system would have been interrupted, although other file system services were still operational. Systems administrators viewed the output of several running jobs and there appeared to be no issues. But, it is possible that some jobs may have been affected by the file system event (specifically those utilizing /work_bgfs that were dispatched during that window). Please review all running job output and kill and re-submit any jobs which are no longer producing output.
|-
|- style="font-size:90%"
|'''09/28/2021 15:43 EDT'''
|'''NFS Server Hiccup'''
At approximately 15:22 EDT on September 28, 2021 one of the CIFS servers associated with cifs.rc.usf.edu and the NFS server providing access to /home and /shares within the MDC data center lost access to its GPFS file system.  As a result, users would not have been able to connect via CIFS for a moment and some job output may have produced "Stale File Handle" messages.
The server had to be rebooted and resumed operations at approximately 15:39 EDT.
A synchronization job caused the issue.  Administrators will move the synchronization job to another machine to prevent this from happening again.
|-
|- style="font-size:90%"
|'''09/17/2021 16:16 EDT'''
|'''Resolved: Infiniband fabric issue in SVC'''
Research Computing administrators were alerted to an issue within the SVC Infiniband fabric at approximately 15:03 EDT.  Once administrators were on site, a misbehaving core Infiniband switch was identified as causing routing issues within the fabric.  The routing issues severed access to the file system for some nodes within the cluster.  As a result, some user jobs may have stalled and/or terminated.  Once the switch was power cycled, access to the file system was restored.  At approximately 15:57 EDT all affected systems were reporting nominial operations.
Please resubmit any and all affected jobs, as this is the only remedy to failed jobs during this time frame.
|-
|- style="font-size:90%"
|'''09/17/2021 15:29 EDT'''
|'''Infiniband fabric issue in SVC'''
RC administrators are actively troubleshooting an issue with multiple Infiniband switches inside of the SVC data center that provides access to the following CIRCE filesystems:
* /home
* /work
* /shares
* /apps
Access to those file systems may be unavailable until service is restored.
RC administrators will provide updates when available.
|-
|- style="font-size:90%"
|'''07/26/2021 17:34 EDT'''
|'''NFS Stale file handle issues resolved'''
After today's cooling event within the MDC data center was resolved, some users did report intermittent issues accessing files on /home or /shares from compute nodes.
The issue was tracked down to an issue with NFS services which unfortunately, required a restart of one of RC's NFS servers.  At this time, all systems are nominal, and no stale file handle errors are being reported.
|-
|- style="font-size:90%"
|'''07/26/2021 10:46 EDT'''
|'''Cooling event, MDC data center 05:45 EDT July 26, 2021'''
As of 10:35 EDT, Monday July 26, 2021 Research Computing administrators have resumed operations within the MDC data center.
We have been informed by building engineers that there was a system malfunction which caused an immediate shutdown of all cooling systems within the data center.  Once a physical inspection was performed within the data center and the "all clear" was given, Research Computing waited until temperatures returned to nominal levels.
At this time all affected systems have resumed operations and are accepting jobs.
If there are any questions and/or comments, please email {{rchelp}}
|-
|- style="font-size:90%"
|'''07/26/2021 06:42 EDT'''
|'''Cooling event, MDC data center 05:45 EDT July 26, 2021'''
Research Computing has been made aware of a cooling event within the MDC data center.
As a precaution, all compute infrastructure within the data center has been powered off.  As more information becomes available, we will post updates.
If there are any questions and/or comments, please email {{rchelp}}
|-
|- style="font-size:90%"
|'''07/23/2021 17:50 EDT'''
|'''DHCPD Event'''
Research Computing administrators were made aware of an issue with several scheduled jobs not starting due to messages stating that nodes were down.
An investigation revealed that ~40 compute nodes did not have their assigned IP addresses, resulting in the scheduler rejecting jobs to said nodes.  The issue was traced back to an error within the DHCP configuration.  The error has been corrected and all nodes have been recovered. 
Any user whose jobs may have failed during this time should resubmit them.  The error messages within users' output files would mainfest a "NODE FAIL" message, or similar.
If there are any questions and/or comments, please email {{rchelp}}
|-
|- style="font-size:90%"
|- style="font-size:90%"
|'''07/16/2021 15:03 EDT'''
|'''07/16/2021 15:03 EDT'''
Line 92: Line 485:


Several notices will be sent out prior to the planned start of the window.
Several notices will be sent out prior to the planned start of the window.
|}


=== Archived News ===
|-
{|class=wikitable
|- style="background-color:#f1edbe;text-align:center;font-size:90%"
|'''Posted on:'''
|'''Details'''
|- style="font-size:90%"
|- style="font-size:90%"
|'''06/04/2021 00:42 EDT'''
|'''06/04/2021 00:42 EDT'''
Line 286: Line 674:


Service is expected to be restored by 10 AM EDT on September 24, 2020, and RC administrators will provide updates as available.
Service is expected to be restored by 10 AM EDT on September 24, 2020, and RC administrators will provide updates as available.
<span id="BeeGFS_reformat"></span>
|-
|-
|- style="font-size:90%"
|- style="font-size:90%"

Revision as of 20:58, 2 June 2022

Recent News

Posted on: Details
06/02/2022 15:53 EDT SLURM interactive job behavior change, update


For example, if you need to request a 16 CPU interactive session you'd use the following syntax:

"srun -c 16 --pty /bin/bash"
"srun --cpus-per-task=16 --pty /bin/bash"
"srun --nodes=1 --cpus-per-task=16 --pty /bin/bash"

The first and second examples will allocate 16 CPU's, potentially over multiple nodes. The third example will only allocate 16 CPU's on a single node.

We would like to clear-up potential confusion with the examples above. When using -c or --cpus-per-task, the total number of cores on a single node are taken into consideration. Therefore, if you request more cores than available per node on any given partition [0][1], you will receive an error. Instead, start with the total number of cores desired for the interactive job and then evenly divide those over a node count (-N or --nodes=).

For example, if an interactive session requires 48 cores on the default "circe" partition it will need to be divided over at least 2 nodes since the "circe" partition only has 24 cores total per node. At a minimum the request would need to be the following:

srun -N 2 -c 24 -t 01:00:00 --pty /bin/bash srun --nodes=2 --cpus-per-task=24 --time=01:00:00 --pty /bin/bash

Please note that requesting all 24 cores per node will result in longer job pending times due to the fact that an entire node is being reserved. It is best to break this specific request over multiple nodes for faster dispatch times. For example, the same 48 core request could be divided over 6 nodes:

srun -N 6 -c 8 -t 01:00:00 --pty /bin/bash srun --nodes=6 --cpus-per-task=8 -t 01:00:00 --pty /bin/bash

Again, these changes have only affected parallel, interactive jobs. Batch jobs are not affected.

05/31/2022 10:27 EDT SLURM interactive job behavior change

Research Computing administrators made a change to SLURM scheduling on May 25, 2022 at approximately 13:27 EDT; cgroups support was implemented.

Due to previous behavior seen on the cluster, without the implementation of cgroups users would have been able to (inadvertently) utilize resources which weren't requested, such as GPU's. In a nutshell, this change will explicitly deny access to resources that weren't requested.

Unfortunately, this change has introduced some "breaking" behavior for interactive-only jobs (srun/salloc) in terms of how CPU's are requested; batch submissions _are not affected_. For any users whose interactive workflows require access to more than a single CPU, the following flags cannot be used to request multiple CPU's any more:

"--ntasks="
"-n"

Instead, please use the following flags to request CPU's:

"--cpus-per-task="
"-c"

For example, if you need to request a 16 CPU interactive session you'd use the following syntax:

"srun -c 16 --pty /bin/bash"
"srun --cpus-per-task=16 --pty /bin/bash"
"srun --nodes=1 --cpus-per-task=16 --pty /bin/bash"

The first and second examples will allocate 16 CPU's, potentially over multiple nodes. The third example will only allocate 16 CPU's on a single node.

Any interactive workflows that require a node count are affected, too. Previously, interactive jobs could request a combination of "--nodes=" and "--ntasks-per-node=". With the new behavior, the following combinations are possible:

"--nodes=2 -c 12"
"--nodes=4 -c 3"
"--nodes=4 -c 2"
"--nodes=1 -c 24"

The first example will request a total of 24 CPU's spread evenly over 2 nodes, the second example will request 12 CPU's spread over 4 nodes, the third example will request 8 CPU's spread over 4 nodes, and the last example will request 24 CPU's on a single node.

All interactive documentation on our Wiki site (https://wiki.rc.usf.edu) will be updated to reflect this change.

Again, batch submissions are not affected. The behavior of "--ntasks", "-n", and "--ntasks-per-node" still functions as expected.

Archived News

Posted on: Details
05/19/2022 15:58 EDT Resolved: SLURM behavior change: exceeded memory limit

This issue has been corrected as of May 19, 2022 at approximately 10:58 EDT.

Ultimately, while not confirmed by the SLURM developers to us directly, there was a regression [0] between SLURM versions 20.11.3 and 20.11.9 which was not corrected until version 21.08.0rc2. Given the concerning message "slurmstepd: error: Exceeded job memory limit", Research Computing administrators decided to perform another update of the SLURM software to the latest version - 21.08.8-2.

Any users whose output files contain the error message above during the time frame in question [1] can simply disregard the messages.

[0] Confirmed via commits to SLURM source code
[1] Monday, May 16, 2022 at 17:00 EDT until May 19, 2022 at 10:37 EDT

05/18/2022 10:52 EDT SLURM behavior change: exceeded memory limit

Due to a security issue, Research Computing administrators upgraded the minor version of SLURM: 20.11.3 to 20.11.9 on Monday, May 16, 2022 from approximately 16:13 until 16:25 EDT.

The upgrade was performed online and no jobs were affected; it was a "text book" upgrade.

Unfortunately, a bug has manifest itself within the upgraded version and it has to do with user's memory requests. The bug is random (yet frequent) in nature and jobs' memory requests are affected. We have observed two causalities from this bug:

1.) Jobs exceeding their memory limits aren't terminated when they should be and an incorrect job ID is sent to the controller. This results in the jobs being able to continue to run since the controller attempts actions on an incorrect job ID (it does not exist). These affected jobs will have the message "slurmstepd: error: Exceeded job memory limit" appear multiple times within output files, when it should only appear once.

2.) Jobs that do not exceed their memory limit on a PER CPU basis (--mem-per-cpu) but whose total sum of process memory do exceed the PER CPU basis are improperly identified as having exceeded their memory limit. These jobs are not terminated, but their output files do contain the same message listed above, "slurmstepd: error: Exceeded job memory limit".

We have reached out to the developers of the SLURM software for confirmation and corrective action. Users do not need to change anything in their submission scripts at this time.

We do not expect there to be too many issues related to this issue, as it is random in nature and compute node health is monitored across the cluster.

Once we receive a confirmation and corrective action to take, we will post an update to the list.

05/10/2022 12:18 EDT Ethernet network issues post-mortem

Research Computing administrators have carefully reviewed all switch logs collected since the first Ethernet issue within the SVC data center (05/02/2022) and the second Ethernet issue within the MDC data center (05/07/2022). A post mortem follows.

Initially, an STP hiccup appeared to be the culprit based upon perused console messaging. Unfortunately, the console messaging buffers at that time were extremely small and administrators were unable to view all messages pertaining to the start of the event. Additional logging was enabled on all switches after the network was restored to a nominal state.

When the second Ethernet issue occurred, collected logs indicated that an optional feature unrelated to STP had reached a pre-determined timeout, resulting in switch ports being disabled as a precaution!

Given the behavior of the optional feature and its pre-determined timeout, Research Computing administrators have disabled it across all switches as of Tuesday, May 10, 2022. This will prevent further, similar issues from recurring on the Ethernet network.

05/05/2022 21:52 EDT amd_2021 partition

The amd_2021 partition has been updated to utilize the same QOS' as submitting to "circe" - none is required. The appropriate starting priority and resource allocation is automatically granted based upon contribution status.

What does this mean? It means that the use of the "preempt_short" QOS is no longer required to submit to the partition. We ask that all amd_2021 partition users please update their submission script(s) and remove the QOS directive. If the "preempt_short" QOS isn't removed, users will receive an error upon submission and their job(s) will be rejected.

04/22/2022 21:52 EDT CIRCE filesystem issue

At 21:52 EDT on 04/22/2022, RC administrators received alerts of an issue with the GPFS file system. The issue was tracked down to a malfunctioning Infiniband switch that caused fabric errors resulting in "Stale File Handle" errors across the CIRCE login and compute notes. During the troubleshooting process, ALL running/pending jobs on the CIRCE cluster were cancelled.

As of 23:09 EDT, the fabric is stable and all Infiniband network operations are nominal.

We ask that any users whose jobs were cancelled to please resubmit those jobs for processing.

04/04/2022 22:08 EDT Cooling issue within SVC data center

At approximately 22:00 EDT on Monday, April 4, 2022 Research Computing administrators were instructed to kill all running jobs on the following partitions, due to an emergency cooling issue within the data center:

circe cool2022 hii02 hii-interactive hii-test himem mri2016

In addition, the nodes were powered off to reduce the heat within the data center. At 00:15 EDT on Tuesday, April 5, 2022, we received the all clear from facilities engineers and have restored cluster operations to the affected partitions.

03/28/2022 09:15 EDT NFS and CIFS server maintenance

Research Computing administrators have started the planned maintenance window to update a pair of NFS and CIFS servers.

These updates are required to address a CIFS connectivity issue with USF's domain controllers, in addition to performance and security updates. The work is expected to be completed by 17:00 EDT. Access to \\cifs.rc.usf.edu will be interrupted during this time.

Unfortunately, one of these NFS servers provides access to /home and /shares to the following partitions:

  • amd_2021
  • bfbsm_2019
  • cbcs
  • cms_ocg
  • hchg
  • henderson_itn18
  • margres_2020
  • muma_2021
  • simmons_itn18
  • snsm_itn19

Because some user jobs will be negatively affected by not being able to access /home or /shares, a decision has been made to terminate all running jobs on the partitions listed above. Users will be able to queue jobs during the maintenance window.

03/21/2022 01:02 EDT Resolved: Infiniband fabric issue in SVC

At 23:38 EDT on 03/20/2022, RC administrators received alerts of an issue with the GPFS file system. The issue was tracked down to a malfunctioning Infiniband switch that caused fabric errors resulting in "Stale File Handle" errors across the CIRCE login and compute notes. During the troubleshooting process, all running/pending jobs on the CIRCE cluster were cancelled.

As of 01:02 EDT, the fabric is stable and all Infiniband network operations are nominal.

We ask that any users whose jobs were cancelled to please resubmit those jobs for processing.

03/20/2022 23:38 EDT Infiniband fabric issue in SVC

RC administrators are actively troubleshooting an issue with multiple Infiniband switches inside of the SVC data center that provides access to the following CIRCE filesystems:

  • /home
  • /work
  • /shares
  • /apps

Access to those file systems may be unavailable until service is restored.

RC administrators will provide updates when available.

12/16/2021 16:06 EST Job submissions restored

Research Computing administrators have restored job submissions to the following partitions as of Thursday, December 16, 2021 at 16:06 EST.

The applications below are still re-synchronizing, and should be finished within the next few hours:

  • cadencetools
  • matlab

12/16/2021 14:27 EST MAJOR BeeGFS maintenance completed

As of Thursday, December 16, 2021 at 10:43 EST Research Computing administrators have re-enabled file system access to /work_bgfs and /shares_bgfs across the cluster. At this time users may start staging their data back to /work_bgfs; in fact, some users already have! Quotas have been re-enabled.

For users whose data resided in /shares_bgfs - there is an active re-synchronization of cleaned (no chunk files) data in progress. Please do not worry if files and/or directories seem missing, they will be restored within the next 24 hours. Once all data has been synchronized, quotas will be re-enabled.

As of Thursday, December 16, 2021 at 13:55 EST access [0] has been restored to the student cluster. All cleaned data which resided within /shares and /home has been restored from backup. Permitted users will be able to login without an issue. Quotas have been re-enabled.

CIFS access to /work_bgfs has been restored on \\cifs.rc.usf.edu. CIFS access to various BeeGFS shares has been restored on \\cifs-pgs.rc.usf.edu.

The following CIRCE system partitions will remain in an inactive state until close of business today, in order to preserve as much bandwidth as possible for the aforementioned re-synchronization process.

  • amd_2021
  • bfbsm_2019
  • bgfsqdr
  • cbcs
  • cms_ocg
  • devel
  • hchg
  • henderson_itn18
  • margres_2020
  • muma_2021
  • simmons_itn18
  • snsm_itn19
  • tfawcett


The following applications are still being synchronized to BeeGFS, and there will be intermittent access issues on the student cluster and the CIRCE partitions listed above until the re-synchronization is finished:

  • cadencetools
  • matlab
  • synopsys


During the maintenance window, all cluster nodes housed in the aforementioned CIRCE partitions had their NFS mount points (/home and /shares) adjusted so that ACL's will be respected. This avoids the issue of needing to explicitly change permissions on content generated by jobs.

The file system was also re-created with performance in mind, and Research Computing administrators hope that all users notice a performance boost once the file system is in a nominal state.

[0] On Monday, December 13, 2021 the authorized groups were rotated since the semester ended.

11/01/2021 16:00 EST MAJOR BeeGFS maintenance planned: December 13, 2021

The final process of recovery for the BeeGFS issues that began on October 9, 2021 will require a reformat and redeployment of /work_bgfs. This maintenance window is scheduled to start December 13, 2021 and last until December 16, 2021.

This process will be no different than what we did previously. The only real changes are the specific SLURM partitions affected, all of which are listed below:

  • amd_2021
  • bfbsm_2019
  • bgfsqdr
  • cbcs
  • cms_ocg
  • devel
  • hchg
  • henderson_itn18
  • margres_2020
  • muma_2021
  • simmons_itn18
  • snsm_itn19
  • tfawcett

During this downtime, the SC student cluster will also be unavailable.

Ultimately, Research Computing staff have decided upon this plan of action so that workarounds will not be required to address current corruption concerns. In addition, during this maintenance window the storage will be redeployed with additional performance in mind.

We ask that all consumers of /work_bgfs begin to archive critical data that cannot be reproduced elsewhere.

Please see the following URL for a quick "how to" regarding data management on CIRCE: https://wiki.rc.usf.edu/index.php/News#BeeGFS_reformat. We advise users to not "blindly" synchronize data from /work_bgfs to /home, due to quotas. Instead, tools such as `tar` should be utilized to create single, compressed archives that are of limited size.

We will send out reminders over the next month and a half. Please do not hesitate to contact us at rc-help@usf.edu if there are any questions and/or concerns regarding this planned maintenance window.

10/21/2021 22:05 EDT Resolved: Infiniband fabric issue in SVC

The Infiniband issue within the SVC data center has been corrected, and all systems are nominal at this time. For interested users, a post mortem follows.

At 20:45 EDT Research Computing administrators received an alert that an Infiniband switch was no longer present on the fabric. Several attempts were made to reboot the switch remotely, but physical access was required for a power cycle. The switch was power cycled and immediately visible on the network at approximately 21:24 EDT, with no more reported network errors. But, the GPFS file system was still in a stale state. Given this network issue, GPFS put itself into a fail safe mode to prevent corruption. The fail safe was disabled at 21:31 EDT, and full connectivity to the file system was restored across the cluster.

The CIRCE compute nodes in the MDC data center were affected slightly longer due to NFS issuing "Stale file handle" messages on the /home and /shares mount points. This issue was corrected at approximately 21:38 EDT.

We advise users who were running jobs during this time frame to check on their output. If there are any issues, please kill any and all affected jobs and resubmit them.

10/21/2021 20:45 EDT Infiniband fabric issue in SVC

RC administrators are actively troubleshooting an issue with the Infiniband fabric inside of the SVC data center that provides access to the following CIRCE filesystems:

  • /home
  • /work
  • /shares
  • /apps

Access to those file systems on CIRCE and via cifs.rc.usf.edu may be unavailable until service is restored.

RC administrators will provide updates when available.

10/09/2021 17:55 EDT BeeGFS oddities October 9, 2021

Research Computing staff have been made aware of minimal reports of file system oddities on /work_bgfs. As a precaution, staff members are also scanning /shares_bgfs.

Administrators are currently investigating these reports on a case by case basis. At this time however, the file system is operational and functioning. Until more information is obtained, the file system will remain operational for all users.

Updates will be posted as they are received. Should any users notice oddities with file access on /work_bgfs and/or /shares_bgfs, please do not hesitate to contact administrators via rc-help@usf.edu. Please include the full path to affected file(s) and any messages received on the console.

09/09/2021 15:54 EDT New AMD SLURM partition: amd_2021

Research Computing is pleased to announce the release of new, general purpose AMD-based computational hardware via the partition "amd_2021".

Each node boasts 128 cores, 1 TB of RAM, and an Infiniband HCA capable of 100Gbp/s. Additional details regarding the hardware and its partition are found via the following URL's:

In order to access these resources, users must specify the "amd_2021" partition and the "preempt_short" QOS. There will be no preemption despite the QOS, and the permitted QOS'es and logical configuration of the resources could change within a few weeks. Research Computing staff will send out a notice to all users if and when these changes are expected.

Please note that some software that has been compiled with the Intel compiler suite may display error messages pertaining to no HFI devices being found. This message is expected and innocuous as the next MPI capable fabric (OmniPath to Infiniband) is found and utilized. In addition, some Intel-specific compiled software may fail with an "illegal instruction" error. If this is the case, please make a note and email rc-help@usf.edu with the name of the application and job number and then resubmit the affected job(s) to other system partitions.

If there are any questions and/or comments, please email rc-help@usf.edu

10/02/2021 19:26 EDT File system event on /work_bgfs

At approximately 18:47 EDT this evening (Saturday, October 2, 2021), RC systems administrators received a page that one of the services of the BeeGFS file system was not responding. RC administrators were able to restore service at approximately 18:57 EDT.

During this window (18:47 to 18:57 EDT), any access to the /work_bgfs file system would have been interrupted, although other file system services were still operational. Systems administrators viewed the output of several running jobs and there appeared to be no issues. But, it is possible that some jobs may have been affected by the file system event (specifically those utilizing /work_bgfs that were dispatched during that window). Please review all running job output and kill and re-submit any jobs which are no longer producing output.

09/28/2021 15:43 EDT NFS Server Hiccup

At approximately 15:22 EDT on September 28, 2021 one of the CIFS servers associated with cifs.rc.usf.edu and the NFS server providing access to /home and /shares within the MDC data center lost access to its GPFS file system. As a result, users would not have been able to connect via CIFS for a moment and some job output may have produced "Stale File Handle" messages.

The server had to be rebooted and resumed operations at approximately 15:39 EDT.

A synchronization job caused the issue. Administrators will move the synchronization job to another machine to prevent this from happening again.

09/17/2021 16:16 EDT Resolved: Infiniband fabric issue in SVC

Research Computing administrators were alerted to an issue within the SVC Infiniband fabric at approximately 15:03 EDT. Once administrators were on site, a misbehaving core Infiniband switch was identified as causing routing issues within the fabric. The routing issues severed access to the file system for some nodes within the cluster. As a result, some user jobs may have stalled and/or terminated. Once the switch was power cycled, access to the file system was restored. At approximately 15:57 EDT all affected systems were reporting nominial operations.

Please resubmit any and all affected jobs, as this is the only remedy to failed jobs during this time frame.

09/17/2021 15:29 EDT Infiniband fabric issue in SVC

RC administrators are actively troubleshooting an issue with multiple Infiniband switches inside of the SVC data center that provides access to the following CIRCE filesystems:

  • /home
  • /work
  • /shares
  • /apps

Access to those file systems may be unavailable until service is restored.

RC administrators will provide updates when available.

07/26/2021 17:34 EDT NFS Stale file handle issues resolved

After today's cooling event within the MDC data center was resolved, some users did report intermittent issues accessing files on /home or /shares from compute nodes.

The issue was tracked down to an issue with NFS services which unfortunately, required a restart of one of RC's NFS servers. At this time, all systems are nominal, and no stale file handle errors are being reported.

07/26/2021 10:46 EDT Cooling event, MDC data center 05:45 EDT July 26, 2021

As of 10:35 EDT, Monday July 26, 2021 Research Computing administrators have resumed operations within the MDC data center.

We have been informed by building engineers that there was a system malfunction which caused an immediate shutdown of all cooling systems within the data center. Once a physical inspection was performed within the data center and the "all clear" was given, Research Computing waited until temperatures returned to nominal levels.

At this time all affected systems have resumed operations and are accepting jobs.

If there are any questions and/or comments, please email rc-help@usf.edu

07/26/2021 06:42 EDT Cooling event, MDC data center 05:45 EDT July 26, 2021

Research Computing has been made aware of a cooling event within the MDC data center. As a precaution, all compute infrastructure within the data center has been powered off. As more information becomes available, we will post updates.

If there are any questions and/or comments, please email rc-help@usf.edu

07/23/2021 17:50 EDT DHCPD Event

Research Computing administrators were made aware of an issue with several scheduled jobs not starting due to messages stating that nodes were down.

An investigation revealed that ~40 compute nodes did not have their assigned IP addresses, resulting in the scheduler rejecting jobs to said nodes. The issue was traced back to an error within the DHCP configuration. The error has been corrected and all nodes have been recovered.

Any user whose jobs may have failed during this time should resubmit them. The error messages within users' output files would mainfest a "NODE FAIL" message, or similar.

If there are any questions and/or comments, please email rc-help@usf.edu

07/16/2021 15:03 EDT Power Outage Update: Normal operations resumed

Power has been restored to the areas of the Tampa Campus that were experiencing the outage earlier today. As such, the following partitions have once again resumed normal operations:

  • bfbsm_2019
  • bgfsqdr
  • cbcs
  • cms_ocg
  • devel
  • hchg
  • henderson_itn18
  • margres_2020
  • rra
  • rra_con2020
  • sc
  • simmons_itn18
  • snsm_itn19
  • tfawcett

If there are any questions and/or comments, please email rc-help@usf.edu

07/16/2021 14:04 EDT Partitions placed into "DOWN" state due to current power outage

Currently, portions of the USF Tampa Campus are experiencing a power outage. Due to this outage, the following partitions have been placed into a "DOWN" state:

  • bfbsm_2019
  • bgfsqdr
  • cbcs
  • cms_ocg
  • devel
  • hchg
  • henderson_itn18
  • margres_2020
  • rra
  • rra_con2020
  • sc
  • simmons_itn18
  • snsm_itn19
  • tfawcett

While the above partitions are in a "DOWN" state, currently running jobs will continue to run, and new jobs will be able to be submitted, but no new jobs will be dispatched to run. All other partitions are currently operating normally.

More information will be provided as it becomes available.

07/12/2021 11:09 EDT File system event on /work_bgfs

At approximately 02:20 EDT this morning (Monday, July 12, 2021), RC systems administrators received a page that one of the services of the BeeGFS file system was not responding. RC administrators were able to restore service at approximately 02:52 EDT.

During this window (02:20 to 02:52 EDT), any access to the /work_bgfs file system would have been interrupted, although other file system services were still operational. Systems administrators viewed the output of several running jobs and there appeared to be no issues. But, it is possible that some jobs may have been affected by the file system event (specifically those utilizing /work_bgfs that were dispatched during that window). Please review all running job output and kill and re-submit any jobs which are no longer producing output.

If there are any questions and/or comments, please email rc-help@usf.edu.

06/21/2021 14:27 EDT BeeGFS maintenance window, July 21, 2021 11:00 EDT

On July 21, 2021 EDT at approximately 11:00 EDT, Research Computing administrators will need to perform a firmware update on the hardware which provides access to the BeeGFS file system. The following partitions will be affected:

  • bfbsm_2019
  • bgfsqdr
  • cbcs
  • cms_ocg
  • devel
  • hchg
  • henderson_itn18
  • margres_2020
  • simmons_itn18
  • snsm_itn19
  • tfawcett

The following file system paths will be affected on the aforementioned systems:

  • /work_bgfs
  • /shares_bgfs
  • /apps

During the maintenance window, the following paths will be affected on the CIRCE login nodes and other RC-based login nodes:

  • /work_bgfs
  • /shares_bgfs

Given the architecture of the file system (redundancy and striping), the maintenance window will be performed while the system is operational, in a rolling fashion. This will be noticeable by users with interactive workflows via temporary "remote I/O error" status messages lasting up to 5 minutes while accessing affected portions of the file system. Batch jobs shouldn't be affected with these messages, given memory caching and asynchronous disk activity.

Several notices will be sent out prior to the planned start of the window.

06/04/2021 00:42 EDT File system event on /work_bgfs

At approximately 00:42 EDT this morning (Friday June 4, 2021) systems administrators received a page that the management service of the BeeGFS file system was not responding.

Attempts to made to access the node hosting the service for 3 minutes. Unfortunately, the node locked-up requiring a reboot. From 00:45 until 00:50 any access to the /work_bgfs file system would have been interrupted, although other file system services were still operational.

A post-reboot inspection revealed that several BeeGFS threads had hung, which prevented other system daemons nominal operations. In addition, system logs indicate that there were sporadic hiccups accessing the file system beginning at 00:05 EDT.

Systems administrators viewed the output of several running jobs and there appeared to be no issues. But, it is possible that some jobs may have been affected by the file system event. Please review all running job output and kill and re-submit any jobs which are no longer producing output. If there are any questions and/or comments, please email rc-help@usf.edu.

05/18/2021 11:47 EDT MDC datacenter maintenance 05/24/2021 at 08:00 EDT

Research Computing has been made aware of planned maintenance within the MDC datacenter.

The maintenance will entail the replacement of older UPS batteries with new batteries. During this scheduled work, the UPS will be placed into bypass mode. No outage is anticipated, but there is still a minimal risk associated with the maintenance. TECO has been contacted in advance as to avoid any potential interruptions in service during the time frame in question.

The maintenance window is expected to last 4 hours, but facilities engineers expect the work to take about an hour. If there are any questions and/or comments, please email rc-help@usf.edu.

04/08/2021 11:00 - 12:00 EDT IPA server upgrades

Today (04/08/2021) as of 11:56 EDT, Research Computing administrators upgraded our identity management servers, and its underlying supporting software.

The upgrade was considered high priority due to a recently discovered and infrequent issue where user accounts would not be resolved on a host, for up to a minute or longer. As a result, this would cause user jobs to be requeued and/or terminated by the scheduler, an inability to login, loss of additional UNIX group access, and "user unknown" errors.

Prior to the decision to upgrade the identity management servers, trouble-shooting was performed with AD administrators to rule out any underlying issues with domain controllers, including inspection of the network. Unfortunately, there were no explicit reasons manifested in the logs as to the source of the issue.

The upgrade appears to have been successful, and logging has improved. Administrators will continue to monitor the situation to ensure that there are no lingering issues present.

If there are any questions and/or comments, please email rc-help@usf.edu.

03/31/2021 15:31 EDT Abrupt job terminations March 31, 2021 at 15:31 EDT

Today during a routine SLURM configuration update, an issue was manifest which prevented the SLURM controller from re-reading its configuration file. Normally, a service restart "corrects" the issue.

Unfortunately, starting from 15:15 EDT this wasn't possible. The issue was tracked down to a stuck job in the database which could not be removed since it required the controller to be operational. As a result, the decision was made to have SLURM start without using its last checkpoint database, resulting in the termination of all running jobs.

Given the state of affairs, some user applications will continue to run and produce output until all "orphaned" processes are reaped via administrative scripts over the course of the next hour.

The only recourse will be for users to re-submit their jobs.

If there are any questions and/or comments, please email rc-help@usf.edu.

03/26/2021 05:11 EDT RRA file system issue Friday March 26, 2021 3:41 - 4:58 EDT

This morning at 03:41 EDT, Research Computing administrators began receiving notices from the storage controllers that comprise the RRA BeeGFS file system.

An investigation began immediately and although the file system was online, any attempt to read and/or write would have resulted in an error. Exercising caution, administrators killed all running jobs on the RRA partitions (rra & rra_con2020), all user sessions on the RRA login nodes, and unmounted the file system. In addition, the to login to the RRA cluster was temporarily suspended.

The root cause of the issue was traced back to a storage controller detecting a potential hardware error. The error was resolved automatically, but in order to ensure that there was no data corruption the controller software placed several disk groups into a "protective" offline status.

As of 04:58 EDT, the errors with the storage controllers were resolved and the file system was remounted without issue. The ability to login was also restored to the RRA login nodes.

If there are any questions and/or comments, please email rc-help@usf.edu.

03/03/2021 08:48 EST Infiniband fabric issue MDC March 3, 2021 05:00 until 08:40 EST

This morning at 05:00 EST Research Computing administrators began receiving notices of sporadic packet loss on the QDR Infiniband fabric within the MDC data center.

The fabric in question supplies connectivity for the following CIRCE partitions:

  • bgfsqdr
  • devel
  • hchg

In addition, the same fabric supplies connectivity to the student cluster. The logs indicated that two SC login nodes were affected during the frame in question.

The issue was tracked down to a misbehaving Infiniband switch. The switch had to be physically removed from the fabric and several switches had to be rebooted. As of 08:40 EST, the fabric is stable and all Infiniband network operations are nominal.

We ask that any users whose jobs were running on the aforementioned partitions please check their output files. Any failed or misbehaving jobs running during the time frame in question should be resubmitted.

If there are any questions and/or comments, please email rc-help@usf.edu.

03/01/2021 12:05 EST SLURM Upgrade Complete

As of Monday March 1, 2020 at 12:05 EST the SLURM upgrade from 16.05.10-2 to 20.11.3 is now complete.

Unfortunately, the upgrade process took longer than the anticipated 1 hour of down time. But, with this upgrade there will be more dispatch options for users, better GPU support, and a wealth of bug fixes. In addition, the /apps mount point on the MDC-based nodes was changed from serial NFS to parallel BeeGFS.

At this time users are free to resume standard cluster operations.

If there are any questions and/or comments, please email rc-help@usf.edu.

02/08/2021 12:46 EST SLURM Upgrade

Research Computing is pleased to announce the planned upgrade of SLURM on CIRCE (16.05.10-2 to the latest, 20.11.3) on March 1, 2021 at 10:00 EST. We expect the downtime to be no longer than 1 hour.

With this release there have been significant administrative additions to scheduling parameters and additional GPU submission options.

What does this mean for users post upgrade? For the most part, users will not notice anything except for job ID's being very low (< 500). All production submission scripts, QOS'es, partition names, etc., will all function as expected. Also, the documentation for salloc/sbatch/srun available on https://slurm.schedmd.com will be the same on CIRCE.

Unfortunately, due to the major version numbers being off by more than 2 releases, all running jobs will need to be terminated as their state information will not be recognized by SLURM 20.11.3.

12/04/2020 22:00 EDT RRA file system hardware upgrade

Research Computing administrators have completed the planned upgrade of the underlying hardware providing access to the RRA file system.

The file system hardware now consists of 5 Dell storage nodes and 2 Dell ME4 series SAS storage arrays utilizing mixed media (SSD and spinning disk). All of the Dell storage nodes are connected to two separate interconnects, Mellanox HDR and Intel OmniPath. In addition, the total usable space of the file system has grown from 161 TB to 349 TB.

This upgrade was necessary as the original RRA file system had been deployed across 4 Dell storage nodes and now EOL'ed DDN FC storage arrays.

10/13/2020 09:11 EDT Emergency GPFS maintenance

Research Computing administrators have been made aware of a disk firmware issue, which requires immediate emergency maintenance by our hardware vendor.

Unfortunately, this process will require a few hours of downtime on Monday, October 19, 2020 in order to update the firmware on the affected disks. Work is expected to begin at 10:00 AM EDT; all jobs will be canceled and users will need to save their work and logout [0], as no I/O can be present on the file system. Research Computing staff will be on site.

Because GPFS is affected [1], the CIRCE and SC clusters, CIFS access, and login node access will be effectively offline as the file system will be unmounted. The RRA cluster _will not_ be affected, as it is not connected to GPFS.

Once the work is completed, Research Computing administrators will remount the file system on the affected systems and will send out a notice.

If there are any questions and/or comments, please email rc-help@usf.edu.

[0] If users are still logged in their sessions will be terminated [1] /home, /shares, /apps, and /work

10/02/2020 09:15 EDT Final reminder: Job disruptions in MDC due to power maintenance

Facilities engineers have scheduled the maintenance to occur this Monday, October 5, 2020 beginning at 08:00 EDT.

Research Computing has been made aware of required PDU maintenance within the MDC data center. The maintenance window is scheduled to begin at approximately 08:00 EDT on July 23, 20202.

Unfortunately, this maintenance window requires one of the 3 PDU's within the data center to be shut down for the duration of the work. The PDU in question supplies 50% of the power to all R.C. assets, both infrastructure and computational. Under normal circumstances (power blip, etc.), redundancy would be provided by another PDU. However, the extended downtime could lead to unacceptable spikes in power draw on a single PDU.

Therefore, in order to ensure that there will be no unforeseen issues related to power, a decision has been made to kill all running jobs 2-3 hours prior to the scheduled maintenance, and to place the following partitions into a down status until the maintenance is completed:

bfbsm_2019 bgfsqdr cms_ocg devel henderson_itn18 rra simmons_itn18 snsm_itn19 tfawcett

The maintenance window should not last for more than 4 hours.

Research Computing will also be moving communications to an announcement-only mailing list to communicate system notices to our current and active user base. The CWA will most likely be decommissioned given that it is only used for communications purposes at this time.

The new mailing list will only be populated with USF email addresses to ensure that all notices are received.

09/26/2020 19:43 EDT BGFS Maintenance Window complete

As of Saturday, September 26, 2020 at 19:25 EDT, the /work_bgfs and /shares_bgfs file system has been placed back into production.

At this time, the following partitions are now re-enabled and are accepting jobs:

bfbsm_2019 bgfsqdr cms_ocg devel henderson_itn18 rra simmons_itn18 snsm_itn19 tfawcett

The CIRCE login nodes had to be rebooted due to a stuck kernel module that prevented the mounting of the /work_bgfs and /shares_bgfs file system. Users are now able to synchronize files back to /work_bgfs and/or /shares_bgfs.

Due to reasons unknown, the issues which arose after the file system upgrade were tracked down to misbehaving hardware. Once the hardware was replaced, testing performed by Research Computing staff confirmed that the issues were no longer present.

Some users may notice ephemeral latency on the file system over the next few days. This is to be expected as the caches are warmed-up on the file system.

Please send any questions and/or comments to rc-help@usf.edu.

09/22/2020 15:41 EDT BGFS Maintenance Window Extension: https://cwa.rc.usf.edu/news/405

Due to issues encountered during the planned maintenance on BGFS, the maintenance window has been extended for an additional 24 hours.

Service is expected to be restored by 10 AM EDT on September 24, 2020, and RC administrators will provide updates as available.

09/18/2020 10:28 EDT This serves as the final reminder of our previously posted BGFS news: https://cwa.rc.usf.edu/news/400

The vendor has advised Research Computing that we must apply a patch to restore complete functionality, and it will require a reformat of all disks associated with the system. Research Computing will perform this work beginning September 21, 2020 at 10:00 EDT. This work will only affect the /work_bgfs scratch space. As you know, this space is not backed-up and is considered volatile storage [0]. This work does not affect /shares_bgfs, but during the reformat the data will be inaccessible.

Research Computing administrators will shut down the /work_bgfs and /shares_bgfs file system on September 21, 2020 at 10:00 EDT. During this time, job submissions to the following partitions will be disabled:

  • bfbsm_2019
  • bgfsqdr
  • cms_ocg
  • devel
  • hchg
  • henderson_itn18
  • simmons_itn18
  • snsm_itn19
  • tfawcett

In addition, access to /work_bgfs and /shares_bgfs will be removed from the CIRCE login nodes and several other login/compute nodes within the cluster for the duration of the work.

Research Computing administrators are planning on a 48 hour downtime window. Therefore, we ask users to plan accordingly.

During the next 5 days, any and all critical data on /work_bgfs must be moved elsewhere, as the file system will be reformatted. As you know, data on /work_bgfs is considered volatile and isn't archived [0].

Any questions and/or comments can be sent to rc-help@usf.edu.

[0] https://wiki.rc.usf.edu/index.php/CIRCE_Data_Archiving#Work_Directory_.28.2Fwork_or_.2Fwork_bgfs.29

09/16/2020 9:06 EDT This serves as a reminder of our previously posted BGFS news: https://cwa.rc.usf.edu/news/400

The vendor has advised Research Computing that we must apply a patch to restore complete functionality, and it will require a reformat of all disks associated with the system. Research Computing will perform this work beginning September 21, 2020 at 10:00 EDT. This work will only affect the /work_bgfs scratch space. As you know, this space is not backed-up and is considered volatile storage [0]. This work does not affect /shares_bgfs, but during the reformat the data will be inaccessible.

Research Computing administrators will shut down the /work_bgfs and /shares_bgfs file system on September 21, 2020 at 10:00 EDT. During this time, job submissions to the following partitions will be disabled:

  • bfbsm_2019
  • bgfsqdr
  • cms_ocg
  • devel
  • hchg
  • henderson_itn18
  • simmons_itn18
  • snsm_itn19
  • tfawcett

In addition, access to /work_bgfs and /shares_bgfs will be removed from the CIRCE login nodes and several other login/compute nodes within the cluster for the duration of the work.

Research Computing administrators are planning on a 48 hour downtime window. Therefore, we ask users to plan accordingly.

During the next 5 days, any and all critical data on /work_bgfs must be moved elsewhere, as the file system will be reformatted. As you know, data on /work_bgfs is considered volatile and isn't archived [0].

Any questions and/or comments can be sent to rc-help@usf.edu.

[0] https://wiki.rc.usf.edu/index.php/CIRCE_Data_Archiving#Work_Directory_.28.2Fwork_or_.2Fwork_bgfs.29

09/10/2020 9:34 EDT This serves as a reminder of our previously posted BGFS news: https://cwa.rc.usf.edu/news/400

The vendor has advised Research Computing that we must apply a patch to restore complete functionality, and it will require a reformat of all disks associated with the system. Research Computing will perform this work beginning September 21, 2020 at 10:00 EDT. This work will only affect the /work_bgfs scratch space. As you know, this space is not backed-up and is considered volatile storage [0]. This work does not affect /shares_bgfs, but during the reformat the data will be inaccessible.

Research Computing administrators will shut down the /work_bgfs and /shares_bgfs file system on September 21, 2020 at 10:00 EDT. During this time, job submissions to the following partitions will be disabled:

  • bfbsm_2019
  • bgfsqdr
  • cms_ocg
  • devel
  • hchg
  • henderson_itn18
  • simmons_itn18
  • snsm_itn19
  • tfawcett

In addition, access to /work_bgfs and /shares_bgfs will be removed from the CIRCE login nodes and several other login/compute nodes within the cluster for the duration of the work.

Research Computing administrators are planning on a 48 hour downtime window. Therefore, we ask users to plan accordingly.

During the next ~2 weeks, any and all critical data on /work_bgfs must be moved elsewhere, as the file system will be reformatted. As you know, data on /work_bgfs is considered volatile and isn't archived [0].

Any questions and/or comments can be sent to rc-help@usf.edu.

[0] https://wiki.rc.usf.edu/index.php/CIRCE_Data_Archiving#Work_Directory_.28.2Fwork_or_.2Fwork_bgfs.29

09/04/2020 15:37 EDT On Wednesday, September 3, 2020 at approximately 08:41 EDT Research Computing administrators noticed a discrepancy in the file system and took administrative action to correct the issue.


During this time frame, at 09:01 EDT and 09:02 EDT, logs indicated a hiccup in metadata processing. Administrators attended to the issue and at 10:04 EDT, metadata processing reported as being stable, and a resync operation commenced and finished at 11:01 EDT without errors. At this time, the file system appeared operating under nominal circumstances.

However, Research Computing monitoring software and file system logs indicate that several more metadata consistency issues were observed, and 3 resyncs were automatically attempted by the file system management software - which failed. Research Computing administrators intervened and did not observe any errors logged within the system. To ensure that the file system was in a clean state, a manual resync was initiated at approximately 15:28 EDT and which completed at 16:26 EDT with errors logged. At this point, action was taken across the cluster to ensure that no intensive I/O would be present on the file system. A decision was made to terminal all jobs and temporarily disable Samba/CIFS (Windows/Mac networked drives via \\cifs-pgs.rc.usf.edu) on some shares.

Unfortunately, another metadata processing issue was reported via system logs, resulting in an automatic resync process being initiated sometime around 17:16 EDT. The standard start messages weren't present in the logs, which is concerning itself.

Due to this instability, Research Computing administrators contacted the vendor for assistance. Research Computing was then instructed to disable certain features of the file system causing the issue, in an effort to restore connectivity and access to user data. Disabling these features ensured that further issues wouldn't occur again. Per the vendor's recommendation, Research Computing administrators restored access to Samba/CIFS (Windows/Mac networked drives via \\cifs-pgs.rc.usf.edu) at 17:27 EDT and access to /work_bgfs and /shares_bgfs at 17:56 EDT the via the computational cluster.

The vendor has advised Research Computing that we must apply a patch to restore complete functionality, and it will require a reformat of all disks associated with the system. Research Computing will perform this work beginning September 21, 2020 at 10:00 EDT. This work will only affect the /work_bgfs scratch space. As you know, this space is not backed-up and is considered volatile storage [0]. This work does not affect /shares_bgfs, but during the reformat the data will be inaccessible.

Any questions and/or comments can be sent to rc-help@usf.edu.

[0] https://wiki.rc.usf.edu/index.php/CIRCE_Data_Archiving#Work_Directory_.28.2Fwork_or_.2Fwork_bgfs.29