Difference between revisions of "News"

(One intermediate revision by the same user not shown)
Line 1: Line 1:
=== Recent News ===
=== Recent News ===
{|class=wikitable
|- style="background-color:#f1edbe;text-align:center;font-size:90%"
|'''Posted on:'''
|'''Details'''
|- style="font-size:90%"
|'''04/08/2021 11:00 - 12:00 EDT'''
|'''IPA server upgrades'''
Today (04/08/2021) as of 11:56 EDT, Research Computing administrators upgraded our identity management servers, and its underlying supporting software.
The upgrade was considered high priority due to a recently discovered and infrequent issue where user accounts would not be resolved on a host, for up to a minute or longer.  As a result, this would cause user jobs to be requeued and/or terminated by the scheduler, an inability to login, loss of additional UNIX group access, and "user unknown" errors.
Prior to the decision to upgrade the identity management servers, trouble-shooting was performed with AD administrators to rule out any underlying issues with domain controllers, including inspection of the network.  Unfortunately, there were no explicit reasons manifested in the logs as to the source of the issue.
The upgrade appears to have been successful, and logging has improved.  Administrators will continue to monitor the situation to ensure that there are no lingering issues present.
If there are any questions and/or comments, please email {{rchelp}}.
|}
=== Archived News ===
{|class=wikitable
{|class=wikitable
|- style="background-color:#f1edbe;text-align:center;font-size:90%"
|- style="background-color:#f1edbe;text-align:center;font-size:90%"
Line 29: Line 49:


If there are any questions and/or comments, please email {{rchelp}}.
If there are any questions and/or comments, please email {{rchelp}}.
|}
=== Archived News ===
{|class=wikitable
|- style="background-color:#f1edbe;text-align:center;font-size:90%"
|'''Posted on:'''
|'''Details'''
|- style="font-size:90%"
|- style="font-size:90%"
|'''03/03/2021 08:48 EST'''
|'''03/03/2021 08:48 EST'''

Revision as of 16:31, 8 April 2021

Recent News

Posted on: Details
04/08/2021 11:00 - 12:00 EDT IPA server upgrades

Today (04/08/2021) as of 11:56 EDT, Research Computing administrators upgraded our identity management servers, and its underlying supporting software.

The upgrade was considered high priority due to a recently discovered and infrequent issue where user accounts would not be resolved on a host, for up to a minute or longer. As a result, this would cause user jobs to be requeued and/or terminated by the scheduler, an inability to login, loss of additional UNIX group access, and "user unknown" errors.

Prior to the decision to upgrade the identity management servers, trouble-shooting was performed with AD administrators to rule out any underlying issues with domain controllers, including inspection of the network. Unfortunately, there were no explicit reasons manifested in the logs as to the source of the issue.

The upgrade appears to have been successful, and logging has improved. Administrators will continue to monitor the situation to ensure that there are no lingering issues present.

If there are any questions and/or comments, please email rc-help@usf.edu.

Archived News

Posted on: Details
03/31/2021 15:31 EDT Abrupt job terminations March 31, 2021 at 15:31 EDT

Today during a routine SLURM configuration update, an issue was manifest which prevented the SLURM controller from re-reading its configuration file. Normally, a service restart "corrects" the issue.

Unfortunately, starting from 15:15 EDT this wasn't possible. The issue was tracked down to a stuck job in the database which could not be removed since it required the controller to be operational. As a result, the decision was made to have SLURM start without using its last checkpoint database, resulting in the termination of all running jobs.

Given the state of affairs, some user applications will continue to run and produce output until all "orphaned" processes are reaped via administrative scripts over the course of the next hour.

The only recourse will be for users to re-submit their jobs.

If there are any questions and/or comments, please email rc-help@usf.edu.

03/26/2021 05:11 EDT RRA file system issue Friday March 26, 2021 3:41 - 4:58 EDT

This morning at 03:41 EDT, Research Computing administrators began receiving notices from the storage controllers that comprise the RRA BeeGFS file system.

An investigation began immediately and although the file system was online, any attempt to read and/or write would have resulted in an error. Exercising caution, administrators killed all running jobs on the RRA partitions (rra & rra_con2020), all user sessions on the RRA login nodes, and unmounted the file system. In addition, the to login to the RRA cluster was temporarily suspended.

The root cause of the issue was traced back to a storage controller detecting a potential hardware error. The error was resolved automatically, but in order to ensure that there was no data corruption the controller software placed several disk groups into a "protective" offline status.

As of 04:58 EDT, the errors with the storage controllers were resolved and the file system was remounted without issue. The ability to login was also restored to the RRA login nodes.

If there are any questions and/or comments, please email rc-help@usf.edu.

03/03/2021 08:48 EST Infiniband fabric issue MDC March 3, 2021 05:00 until 08:40 EST

This morning at 05:00 EST Research Computing administrators began receiving notices of sporadic packet loss on the QDR Infiniband fabric within the MDC data center.

The fabric in question supplies connectivity for the following CIRCE partitions:

  • bgfsqdr
  • devel
  • hchg

In addition, the same fabric supplies connectivity to the student cluster. The logs indicated that two SC login nodes were affected during the frame in question.

The issue was tracked down to a misbehaving Infiniband switch. The switch had to be physically removed from the fabric and several switches had to be rebooted. As of 08:40 EST, the fabric is stable and all Infiniband network operations are nominal.

We ask that any users whose jobs were running on the aforementioned partitions please check their output files. Any failed or misbehaving jobs running during the time frame in question should be resubmitted.

If there are any questions and/or comments, please email rc-help@usf.edu.

03/01/2021 12:05 EST SLURM Upgrade Complete

As of Monday March 1, 2020 at 12:05 EST the SLURM upgrade from 16.05.10-2 to 20.11.3 is now complete.

Unfortunately, the upgrade process took longer than the anticipated 1 hour of down time. But, with this upgrade there will be more dispatch options for users, better GPU support, and a wealth of bug fixes. In addition, the /apps mount point on the MDC-based nodes was changed from serial NFS to parallel BeeGFS.

At this time users are free to resume standard cluster operations.

If there are any questions and/or comments, please email rc-help@usf.edu.

02/08/2021 12:46 EST SLURM Upgrade

Research Computing is pleased to announce the planned upgrade of SLURM on CIRCE (16.05.10-2 to the latest, 20.11.3) on March 1, 2021 at 10:00 EST. We expect the downtime to be no longer than 1 hour.

With this release there have been significant administrative additions to scheduling parameters and additional GPU submission options.

What does this mean for users post upgrade? For the most part, users will not notice anything except for job ID's being very low (< 500). All production submission scripts, QOS'es, partition names, etc., will all function as expected. Also, the documentation for salloc/sbatch/srun available on https://slurm.schedmd.com will be the same on CIRCE.

Unfortunately, due to the major version numbers being off by more than 2 releases, all running jobs will need to be terminated as their state information will not be recognized by SLURM 20.11.3.

12/04/2020 22:00 EDT RRA file system hardware upgrade

Research Computing administrators have completed the planned upgrade of the underlying hardware providing access to the RRA file system.

The file system hardware now consists of 5 Dell storage nodes and 2 Dell ME4 series SAS storage arrays utilizing mixed media (SSD and spinning disk). All of the Dell storage nodes are connected to two separate interconnects, Mellanox HDR and Intel OmniPath. In addition, the total usable space of the file system has grown from 161 TB to 349 TB.

This upgrade was necessary as the original RRA file system had been deployed across 4 Dell storage nodes and now EOL'ed DDN FC storage arrays.

10/13/2020 09:11 EDT Emergency GPFS maintenance

Research Computing administrators have been made aware of a disk firmware issue, which requires immediate emergency maintenance by our hardware vendor.

Unfortunately, this process will require a few hours of downtime on Monday, October 19, 2020 in order to update the firmware on the affected disks. Work is expected to begin at 10:00 AM EDT; all jobs will be canceled and users will need to save their work and logout [0], as no I/O can be present on the file system. Research Computing staff will be on site.

Because GPFS is affected [1], the CIRCE and SC clusters, CIFS access, and login node access will be effectively offline as the file system will be unmounted. The RRA cluster _will not_ be affected, as it is not connected to GPFS.

Once the work is completed, Research Computing administrators will remount the file system on the affected systems and will send out a notice.

If there are any questions and/or comments, please email rc-help@usf.edu.

[0] If users are still logged in their sessions will be terminated [1] /home, /shares, /apps, and /work

10/02/2020 09:15 EDT Final reminder: Job disruptions in MDC due to power maintenance

Facilities engineers have scheduled the maintenance to occur this Monday, October 5, 2020 beginning at 08:00 EDT.

Research Computing has been made aware of required PDU maintenance within the MDC data center. The maintenance window is scheduled to begin at approximately 08:00 EDT on July 23, 20202.

Unfortunately, this maintenance window requires one of the 3 PDU's within the data center to be shut down for the duration of the work. The PDU in question supplies 50% of the power to all R.C. assets, both infrastructure and computational. Under normal circumstances (power blip, etc.), redundancy would be provided by another PDU. However, the extended downtime could lead to unacceptable spikes in power draw on a single PDU.

Therefore, in order to ensure that there will be no unforeseen issues related to power, a decision has been made to kill all running jobs 2-3 hours prior to the scheduled maintenance, and to place the following partitions into a down status until the maintenance is completed:

bfbsm_2019 bgfsqdr cms_ocg devel henderson_itn18 rra simmons_itn18 snsm_itn19 tfawcett

The maintenance window should not last for more than 4 hours.

Research Computing will also be moving communications to an announcement-only mailing list to communicate system notices to our current and active user base. The CWA will most likely be decommissioned given that it is only used for communications purposes at this time.

The new mailing list will only be populated with USF email addresses to ensure that all notices are received.

09/26/2020 19:43 EDT BGFS Maintenance Window complete

As of Saturday, September 26, 2020 at 19:25 EDT, the /work_bgfs and /shares_bgfs file system has been placed back into production.

At this time, the following partitions are now re-enabled and are accepting jobs:

bfbsm_2019 bgfsqdr cms_ocg devel henderson_itn18 rra simmons_itn18 snsm_itn19 tfawcett

The CIRCE login nodes had to be rebooted due to a stuck kernel module that prevented the mounting of the /work_bgfs and /shares_bgfs file system. Users are now able to synchronize files back to /work_bgfs and/or /shares_bgfs.

Due to reasons unknown, the issues which arose after the file system upgrade were tracked down to misbehaving hardware. Once the hardware was replaced, testing performed by Research Computing staff confirmed that the issues were no longer present.

Some users may notice ephemeral latency on the file system over the next few days. This is to be expected as the caches are warmed-up on the file system.

Please send any questions and/or comments to rc-help@usf.edu.

09/22/2020 15:41 EDT BGFS Maintenance Window Extension: https://cwa.rc.usf.edu/news/405

Due to issues encountered during the planned maintenance on BGFS, the maintenance window has been extended for an additional 24 hours.

Service is expected to be restored by 10 AM EDT on September 24, 2020, and RC administrators will provide updates as available.

09/18/2020 10:28 EDT This serves as the final reminder of our previously posted BGFS news: https://cwa.rc.usf.edu/news/400

The vendor has advised Research Computing that we must apply a patch to restore complete functionality, and it will require a reformat of all disks associated with the system. Research Computing will perform this work beginning September 21, 2020 at 10:00 EDT. This work will only affect the /work_bgfs scratch space. As you know, this space is not backed-up and is considered volatile storage [0]. This work does not affect /shares_bgfs, but during the reformat the data will be inaccessible.

Research Computing administrators will shut down the /work_bgfs and /shares_bgfs file system on September 21, 2020 at 10:00 EDT. During this time, job submissions to the following partitions will be disabled:

  • bfbsm_2019
  • bgfsqdr
  • cms_ocg
  • devel
  • hchg
  • henderson_itn18
  • simmons_itn18
  • snsm_itn19
  • tfawcett

In addition, access to /work_bgfs and /shares_bgfs will be removed from the CIRCE login nodes and several other login/compute nodes within the cluster for the duration of the work.

Research Computing administrators are planning on a 48 hour downtime window. Therefore, we ask users to plan accordingly.

During the next 5 days, any and all critical data on /work_bgfs must be moved elsewhere, as the file system will be reformatted. As you know, data on /work_bgfs is considered volatile and isn't archived [0].

Any questions and/or comments can be sent to rc-help@usf.edu.

[0] https://wiki.rc.usf.edu/index.php/CIRCE_Data_Archiving#Work_Directory_.28.2Fwork_or_.2Fwork_bgfs.29

09/16/2020 9:06 EDT This serves as a reminder of our previously posted BGFS news: https://cwa.rc.usf.edu/news/400

The vendor has advised Research Computing that we must apply a patch to restore complete functionality, and it will require a reformat of all disks associated with the system. Research Computing will perform this work beginning September 21, 2020 at 10:00 EDT. This work will only affect the /work_bgfs scratch space. As you know, this space is not backed-up and is considered volatile storage [0]. This work does not affect /shares_bgfs, but during the reformat the data will be inaccessible.

Research Computing administrators will shut down the /work_bgfs and /shares_bgfs file system on September 21, 2020 at 10:00 EDT. During this time, job submissions to the following partitions will be disabled:

  • bfbsm_2019
  • bgfsqdr
  • cms_ocg
  • devel
  • hchg
  • henderson_itn18
  • simmons_itn18
  • snsm_itn19
  • tfawcett

In addition, access to /work_bgfs and /shares_bgfs will be removed from the CIRCE login nodes and several other login/compute nodes within the cluster for the duration of the work.

Research Computing administrators are planning on a 48 hour downtime window. Therefore, we ask users to plan accordingly.

During the next 5 days, any and all critical data on /work_bgfs must be moved elsewhere, as the file system will be reformatted. As you know, data on /work_bgfs is considered volatile and isn't archived [0].

Any questions and/or comments can be sent to rc-help@usf.edu.

[0] https://wiki.rc.usf.edu/index.php/CIRCE_Data_Archiving#Work_Directory_.28.2Fwork_or_.2Fwork_bgfs.29

09/10/2020 9:34 EDT This serves as a reminder of our previously posted BGFS news: https://cwa.rc.usf.edu/news/400

The vendor has advised Research Computing that we must apply a patch to restore complete functionality, and it will require a reformat of all disks associated with the system. Research Computing will perform this work beginning September 21, 2020 at 10:00 EDT. This work will only affect the /work_bgfs scratch space. As you know, this space is not backed-up and is considered volatile storage [0]. This work does not affect /shares_bgfs, but during the reformat the data will be inaccessible.

Research Computing administrators will shut down the /work_bgfs and /shares_bgfs file system on September 21, 2020 at 10:00 EDT. During this time, job submissions to the following partitions will be disabled:

  • bfbsm_2019
  • bgfsqdr
  • cms_ocg
  • devel
  • hchg
  • henderson_itn18
  • simmons_itn18
  • snsm_itn19
  • tfawcett

In addition, access to /work_bgfs and /shares_bgfs will be removed from the CIRCE login nodes and several other login/compute nodes within the cluster for the duration of the work.

Research Computing administrators are planning on a 48 hour downtime window. Therefore, we ask users to plan accordingly.

During the next ~2 weeks, any and all critical data on /work_bgfs must be moved elsewhere, as the file system will be reformatted. As you know, data on /work_bgfs is considered volatile and isn't archived [0].

Any questions and/or comments can be sent to rc-help@usf.edu.

[0] https://wiki.rc.usf.edu/index.php/CIRCE_Data_Archiving#Work_Directory_.28.2Fwork_or_.2Fwork_bgfs.29

09/04/2020 15:37 EDT On Wednesday, September 3, 2020 at approximately 08:41 EDT Research Computing administrators noticed a discrepancy in the file system and took administrative action to correct the issue.


During this time frame, at 09:01 EDT and 09:02 EDT, logs indicated a hiccup in metadata processing. Administrators attended to the issue and at 10:04 EDT, metadata processing reported as being stable, and a resync operation commenced and finished at 11:01 EDT without errors. At this time, the file system appeared operating under nominal circumstances.

However, Research Computing monitoring software and file system logs indicate that several more metadata consistency issues were observed, and 3 resyncs were automatically attempted by the file system management software - which failed. Research Computing administrators intervened and did not observe any errors logged within the system. To ensure that the file system was in a clean state, a manual resync was initiated at approximately 15:28 EDT and which completed at 16:26 EDT with errors logged. At this point, action was taken across the cluster to ensure that no intensive I/O would be present on the file system. A decision was made to terminal all jobs and temporarily disable Samba/CIFS (Windows/Mac networked drives via \\cifs-pgs.rc.usf.edu) on some shares.

Unfortunately, another metadata processing issue was reported via system logs, resulting in an automatic resync process being initiated sometime around 17:16 EDT. The standard start messages weren't present in the logs, which is concerning itself.

Due to this instability, Research Computing administrators contacted the vendor for assistance. Research Computing was then instructed to disable certain features of the file system causing the issue, in an effort to restore connectivity and access to user data. Disabling these features ensured that further issues wouldn't occur again. Per the vendor's recommendation, Research Computing administrators restored access to Samba/CIFS (Windows/Mac networked drives via \\cifs-pgs.rc.usf.edu) at 17:27 EDT and access to /work_bgfs and /shares_bgfs at 17:56 EDT the via the computational cluster.

The vendor has advised Research Computing that we must apply a patch to restore complete functionality, and it will require a reformat of all disks associated with the system. Research Computing will perform this work beginning September 21, 2020 at 10:00 EDT. This work will only affect the /work_bgfs scratch space. As you know, this space is not backed-up and is considered volatile storage [0]. This work does not affect /shares_bgfs, but during the reformat the data will be inaccessible.

Any questions and/or comments can be sent to rc-help@usf.edu.

[0] https://wiki.rc.usf.edu/index.php/CIRCE_Data_Archiving#Work_Directory_.28.2Fwork_or_.2Fwork_bgfs.29