News

Revision as of 14:29, 18 September 2020 by Desantis (talk | contribs) (→‎Recent News)

Recent News

Date & Time Details
09/18/2020 10:28 EDT This serves as the final reminder of our previously posted BGFS news: https://cwa.rc.usf.edu/news/400

The vendor has advised Research Computing that we must apply a patch to restore complete functionality, and it will require a reformat of all disks associated with the system. Research Computing will perform this work beginning September 21, 2020 at 10:00 EDT. This work will only affect the /work_bgfs scratch space. As you know, this space is not backed-up and is considered volatile storage [0]. This work does not affect /shares_bgfs, but during the reformat the data will be inaccessible.

Research Computing administrators will shut down the /work_bgfs and /shares_bgfs file system on September 21, 2020 at 10:00 EDT. During this time, job submissions to the following partitions will be disabled:

  • bfbsm_2019
  • bgfsqdr
  • cms_ocg
  • devel
  • hchg
  • henderson_itn18
  • simmons_itn18
  • snsm_itn19
  • tfawcett

In addition, access to /work_bgfs and /shares_bgfs will be removed from the CIRCE login nodes and several other login/compute nodes within the cluster for the duration of the work.

Research Computing administrators are planning on a 48 hour downtime window. Therefore, we ask users to plan accordingly.

During the next 5 days, any and all critical data on /work_bgfs must be moved elsewhere, as the file system will be reformatted. As you know, data on /work_bgfs is considered volatile and isn't archived [0].

Any questions and/or comments can be sent to rc-help@usf.edu.

[0] https://wiki.rc.usf.edu/index.php/CIRCE_Data_Archiving#Work_Directory_.28.2Fwork_or_.2Fwork_bgfs.29

09/16/2020 9:06 EDT This serves as a reminder of our previously posted BGFS news: https://cwa.rc.usf.edu/news/400

The vendor has advised Research Computing that we must apply a patch to restore complete functionality, and it will require a reformat of all disks associated with the system. Research Computing will perform this work beginning September 21, 2020 at 10:00 EDT. This work will only affect the /work_bgfs scratch space. As you know, this space is not backed-up and is considered volatile storage [0]. This work does not affect /shares_bgfs, but during the reformat the data will be inaccessible.

Research Computing administrators will shut down the /work_bgfs and /shares_bgfs file system on September 21, 2020 at 10:00 EDT. During this time, job submissions to the following partitions will be disabled:

  • bfbsm_2019
  • bgfsqdr
  • cms_ocg
  • devel
  • hchg
  • henderson_itn18
  • simmons_itn18
  • snsm_itn19
  • tfawcett

In addition, access to /work_bgfs and /shares_bgfs will be removed from the CIRCE login nodes and several other login/compute nodes within the cluster for the duration of the work.

Research Computing administrators are planning on a 48 hour downtime window. Therefore, we ask users to plan accordingly.

During the next 5 days, any and all critical data on /work_bgfs must be moved elsewhere, as the file system will be reformatted. As you know, data on /work_bgfs is considered volatile and isn't archived [0].

Any questions and/or comments can be sent to rc-help@usf.edu.

[0] https://wiki.rc.usf.edu/index.php/CIRCE_Data_Archiving#Work_Directory_.28.2Fwork_or_.2Fwork_bgfs.29

09/10/2020 9:34 EDT This serves as a reminder of our previously posted BGFS news: https://cwa.rc.usf.edu/news/400

The vendor has advised Research Computing that we must apply a patch to restore complete functionality, and it will require a reformat of all disks associated with the system. Research Computing will perform this work beginning September 21, 2020 at 10:00 EDT. This work will only affect the /work_bgfs scratch space. As you know, this space is not backed-up and is considered volatile storage [0]. This work does not affect /shares_bgfs, but during the reformat the data will be inaccessible.

Research Computing administrators will shut down the /work_bgfs and /shares_bgfs file system on September 21, 2020 at 10:00 EDT. During this time, job submissions to the following partitions will be disabled:

  • bfbsm_2019
  • bgfsqdr
  • cms_ocg
  • devel
  • hchg
  • henderson_itn18
  • simmons_itn18
  • snsm_itn19
  • tfawcett

In addition, access to /work_bgfs and /shares_bgfs will be removed from the CIRCE login nodes and several other login/compute nodes within the cluster for the duration of the work.

Research Computing administrators are planning on a 48 hour downtime window. Therefore, we ask users to plan accordingly.

During the next ~2 weeks, any and all critical data on /work_bgfs must be moved elsewhere, as the file system will be reformatted. As you know, data on /work_bgfs is considered volatile and isn't archived [0].

Any questions and/or comments can be sent to rc-help@usf.edu.

[0] https://wiki.rc.usf.edu/index.php/CIRCE_Data_Archiving#Work_Directory_.28.2Fwork_or_.2Fwork_bgfs.29

09/04/2020 15:37 EDT On Wednesday, September 3, 2020 at approximately 08:41 EDT Research Computing administrators noticed a discrepancy in the file system and took administrative action to correct the issue.


During this time frame, at 09:01 EDT and 09:02 EDT, logs indicated a hiccup in metadata processing. Administrators attended to the issue and at 10:04 EDT, metadata processing reported as being stable, and a resync operation commenced and finished at 11:01 EDT without errors. At this time, the file system appeared operating under nominal circumstances.

However, Research Computing monitoring software and file system logs indicate that several more metadata consistency issues were observed, and 3 resyncs were automatically attempted by the file system management software - which failed. Research Computing administrators intervened and did not observe any errors logged within the system. To ensure that the file system was in a clean state, a manual resync was initiated at approximately 15:28 EDT and which completed at 16:26 EDT with errors logged. At this point, action was taken across the cluster to ensure that no intensive I/O would be present on the file system. A decision was made to terminal all jobs and temporarily disable Samba/CIFS (Windows/Mac networked drives via \\cifs-pgs.rc.usf.edu) on some shares.

Unfortunately, another metadata processing issue was reported via system logs, resulting in an automatic resync process being initiated sometime around 17:16 EDT. The standard start messages weren't present in the logs, which is concerning itself.

Due to this instability, Research Computing administrators contacted the vendor for assistance. Research Computing was then instructed to disable certain features of the file system causing the issue, in an effort to restore connectivity and access to user data. Disabling these features ensured that further issues wouldn't occur again. Per the vendor's recommendation, Research Computing administrators restored access to Samba/CIFS (Windows/Mac networked drives via \\cifs-pgs.rc.usf.edu) at 17:27 EDT and access to /work_bgfs and /shares_bgfs at 17:56 EDT the via the computational cluster.

The vendor has advised Research Computing that we must apply a patch to restore complete functionality, and it will require a reformat of all disks associated with the system. Research Computing will perform this work beginning September 21, 2020 at 10:00 EDT. This work will only affect the /work_bgfs scratch space. As you know, this space is not backed-up and is considered volatile storage [0]. This work does not affect /shares_bgfs, but during the reformat the data will be inaccessible.

Any questions and/or comments can be sent to rc-help@usf.edu.

[0] https://wiki.rc.usf.edu/index.php/CIRCE_Data_Archiving#Work_Directory_.28.2Fwork_or_.2Fwork_bgfs.29