Skip to content

Latest commit

 

History

History
42 lines (30 loc) · 2.66 KB

File metadata and controls

42 lines (30 loc) · 2.66 KB

Status & Outages

Outages

Current

Upcoming

When Duration What
April 21st 2026 10:00-11:00 PDT 1 hr (planned) DTN nodes s3dfdtn.slac.stanford.edu, sdfdtn[001-006] will be rebooted during the maintenance window to apply security updates. This may interrupt currently-running transfers. Reboots will be done in batches to minimize disruption.

Past

When Duration What
March 18th 2026 11:00-12:00 PDT 1 hr (planned) DNS maintenance for s3dflogin s3dflogin-mfa s3dfdtn
February 4th 2026 9:00-13:00 PST (planned) Shutdown the Globus node  “sdfdtn004” for a network card upgrade.
July 7th 2025 8 days (un)planned The Stanford Facilities team need to conduct an evaluation of the SRCF datacenter transformers. All S3DF services will be unavailable.
Feb 6th 2025 17 hrs (planned) An 800A breaker on the M2 Mechanical Substation had to be replaced. The entire substation was powered down resulting in a significant loss of datacenter cooling.
Dec 26 2024 12 days (unplanned) One of our core networking switches in the data center failed and had to be replaced. The fall-out from this impacted other systems and services on S3DF. Staff worked through the night on stabilization of the network devices and connections as well as recovery of the storage subsystem.
Dec 10 2024 (unplanned) StaaS GPFS disk array outage (partial /gpfs/slac/staas/fs1 unavailability)
Dec 3 2024 1 hr (planned) Mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes.
Nov 18 2024 8 days (unplanned) StaaS GPFS disk array outage (partial /gpfs/slac/staas/fs1 unavailability)
Oct 21 2024 10 hrs (planned) Upgrade to all S3DF Weka clusters. We do NOT anticipate service interruptions.
Oct 3 2024 1.5 hrs (unplanned) Storage issue impacted home directory access and SSH logins
Jul 10 2024 4 days (planned) Urgent electrical maintenance is required in SRCF datacenter
Jun 26 2023 5 days (planned) Everything down due to power outage
Jan 15 2023 2 days (unplanned) Fix: one weka server rebooted. Underlying issue under investigation. Symptom: sdfdata hanging on several nodes.

Monitoring Dashboards

Grafana

Roadmap :id=roadmap

Please see our Technology Migration Timeline (Select the TIMELINE tab)

Slurm Dashboard

sdf-slurm-summary