Difference between revisions of "News"
Line 4: | Line 4: | ||
|'''Date & Time''' | |'''Date & Time''' | ||
|'''Details''' | |'''Details''' | ||
|- style="font-size:90%" | |||
|'''10/02/2020 09:15 EDT''' | |||
|'''Final reminder: Job disruptions in MDC due to power maintenance''' | |||
'''Facilities engineers have scheduled the maintenance to occur this Monday, October 5, 2020 beginning at 08:00 EDT.''' | |||
{{Quote|Research Computing has been made aware of required PDU maintenance within the MDC data center. The maintenance window is scheduled to begin at approximately 08:00 EDT on July 23, 20202. | |||
Unfortunately, this maintenance window requires one of the 3 PDU's within the data center to be shut down for the duration of the work. The PDU in question supplies 50% of the power to all R.C. assets, both infrastructure and computational. Under normal circumstances (power blip, etc.), redundancy would be provided by another PDU. However, the extended downtime could lead to unacceptable spikes in power draw on a single PDU. | |||
Therefore, in order to ensure that there will be no unforeseen issues related to power, a decision has been made to kill all running jobs 2-3 hours prior to the scheduled maintenance, and to place the following partitions into a down status until the maintenance is completed: | |||
bfbsm_2019 | |||
bgfsqdr | |||
cms_ocg | |||
devel | |||
henderson_itn18 | |||
rra | |||
simmons_itn18 | |||
snsm_itn19 | |||
tfawcett | |||
The maintenance window should not last for more than 4 hours.}} | |||
Research Computing will also be moving communications to an announcement-only mailing list to communicate system notices to our current and active user base. The CWA will most likely be decommissioned given that it is only used for communications purposes at this time. | |||
The new mailing list will only be populated with USF email addresses to ensure that all notices are received. | |||
|- | |||
|- style="font-size:90%" | |||
|'''09/26/2020 19:43 EDT''' | |||
|'''BGFS Maintenance Window complete''' | |||
As of Saturday, September 26, 2020 at 19:25 EDT, the /work_bgfs and /shares_bgfs file system has been placed back into production. | |||
At this time, the following partitions are now re-enabled and are accepting jobs: | |||
bfbsm_2019 | |||
bgfsqdr | |||
cms_ocg | |||
devel | |||
henderson_itn18 | |||
rra | |||
simmons_itn18 | |||
snsm_itn19 | |||
tfawcett | |||
The CIRCE login nodes had to be rebooted due to a stuck kernel module that prevented the mounting of the /work_bgfs and /shares_bgfs file system. Users are now able to synchronize files back to /work_bgfs and/or /shares_bgfs. | |||
Due to reasons unknown, the issues which arose after the file system upgrade were tracked down to misbehaving hardware. Once the hardware was replaced, testing performed by Research Computing staff confirmed that the issues were no longer present. | |||
Some users may notice ephemeral latency on the file system over the next few days. This is to be expected as the caches are warmed-up on the file system. | |||
Please send any questions and/or comments to rc-help@usf.edu. | |||
|- | |||
|- style="font-size:90%" | |- style="font-size:90%" | ||
|'''09/22/2020 15:41 EDT''' | |'''09/22/2020 15:41 EDT''' | ||
|'''BGFS Maintenance Window Extension: https://cwa.rc.usf.edu/news/405''' | |'''BGFS Maintenance Window Extension: https://cwa.rc.usf.edu/news/405''' | ||
Due to issues encountered during the planned maintenance on BGFS, the maintenance window has been extended for an additional 24 hours. | Due to issues encountered during the planned maintenance on BGFS, the maintenance window has been extended for an additional 24 hours. | ||
Service is expected to be restored by 10 AM EDT on September 24, 2020, and RC administrators will provide updates as available. | Service is expected to be restored by 10 AM EDT on September 24, 2020, and RC administrators will provide updates as available. | ||
|- | |||
|- style="font-size:90%" | |- style="font-size:90%" | ||
|'''09/18/2020 10:28 EDT''' | |'''09/18/2020 10:28 EDT''' |
Revision as of 13:15, 2 October 2020
Recent News
Date & Time | Details |
10/02/2020 09:15 EDT | Final reminder: Job disruptions in MDC due to power maintenance
Facilities engineers have scheduled the maintenance to occur this Monday, October 5, 2020 beginning at 08:00 EDT.
Research Computing will also be moving communications to an announcement-only mailing list to communicate system notices to our current and active user base. The CWA will most likely be decommissioned given that it is only used for communications purposes at this time. The new mailing list will only be populated with USF email addresses to ensure that all notices are received. |
09/26/2020 19:43 EDT | BGFS Maintenance Window complete
As of Saturday, September 26, 2020 at 19:25 EDT, the /work_bgfs and /shares_bgfs file system has been placed back into production. At this time, the following partitions are now re-enabled and are accepting jobs: bfbsm_2019 bgfsqdr cms_ocg devel henderson_itn18 rra simmons_itn18 snsm_itn19 tfawcett The CIRCE login nodes had to be rebooted due to a stuck kernel module that prevented the mounting of the /work_bgfs and /shares_bgfs file system. Users are now able to synchronize files back to /work_bgfs and/or /shares_bgfs. Due to reasons unknown, the issues which arose after the file system upgrade were tracked down to misbehaving hardware. Once the hardware was replaced, testing performed by Research Computing staff confirmed that the issues were no longer present. Some users may notice ephemeral latency on the file system over the next few days. This is to be expected as the caches are warmed-up on the file system. Please send any questions and/or comments to rc-help@usf.edu. |
09/22/2020 15:41 EDT | BGFS Maintenance Window Extension: https://cwa.rc.usf.edu/news/405
Due to issues encountered during the planned maintenance on BGFS, the maintenance window has been extended for an additional 24 hours. Service is expected to be restored by 10 AM EDT on September 24, 2020, and RC administrators will provide updates as available. |
09/18/2020 10:28 EDT | This serves as the final reminder of our previously posted BGFS news: https://cwa.rc.usf.edu/news/400
|
09/16/2020 9:06 EDT | This serves as a reminder of our previously posted BGFS news: https://cwa.rc.usf.edu/news/400
|
09/10/2020 9:34 EDT | This serves as a reminder of our previously posted BGFS news: https://cwa.rc.usf.edu/news/400
|
09/04/2020 15:37 EDT | On Wednesday, September 3, 2020 at approximately 08:41 EDT Research Computing administrators noticed a discrepancy in the file system and took administrative action to correct the issue.
However, Research Computing monitoring software and file system logs indicate that several more metadata consistency issues were observed, and 3 resyncs were automatically attempted by the file system management software - which failed. Research Computing administrators intervened and did not observe any errors logged within the system. To ensure that the file system was in a clean state, a manual resync was initiated at approximately 15:28 EDT and which completed at 16:26 EDT with errors logged. At this point, action was taken across the cluster to ensure that no intensive I/O would be present on the file system. A decision was made to terminal all jobs and temporarily disable Samba/CIFS (Windows/Mac networked drives via \\cifs-pgs.rc.usf.edu) on some shares. Unfortunately, another metadata processing issue was reported via system logs, resulting in an automatic resync process being initiated sometime around 17:16 EDT. The standard start messages weren't present in the logs, which is concerning itself. Due to this instability, Research Computing administrators contacted the vendor for assistance. Research Computing was then instructed to disable certain features of the file system causing the issue, in an effort to restore connectivity and access to user data. Disabling these features ensured that further issues wouldn't occur again. Per the vendor's recommendation, Research Computing administrators restored access to Samba/CIFS (Windows/Mac networked drives via \\cifs-pgs.rc.usf.edu) at 17:27 EDT and access to /work_bgfs and /shares_bgfs at 17:56 EDT the via the computational cluster. The vendor has advised Research Computing that we must apply a patch to restore complete functionality, and it will require a reformat of all disks associated with the system. Research Computing will perform this work beginning September 21, 2020 at 10:00 EDT. This work will only affect the /work_bgfs scratch space. As you know, this space is not backed-up and is considered volatile storage [0]. This work does not affect /shares_bgfs, but during the reformat the data will be inaccessible. Any questions and/or comments can be sent to rc-help@usf.edu. [0] https://wiki.rc.usf.edu/index.php/CIRCE_Data_Archiving#Work_Directory_.28.2Fwork_or_.2Fwork_bgfs.29 |