Details on Outage and Recovery of PLoS Journal Websites
UnitedLayer, our collocation facility for the production servers, experienced an outage yesterday. From UnitedLayer: “A series of power brownouts occurred today at 2:56 PM PST due to PG&E instability related to the recent storms. Our 300KVA UPS system is not working as designed, the temporary repairs from last week did not hold. We anticipate a faulty motherboard.”
A number of our servers (all powered by the 300KVA UPS) lost power at that time. Our large disk array (2TB of storage) that is the file server for both Fedora and Mulgara had a boot failure and refused to power up. Russ went to the colo and restarted the disk array which went into an automatic rebuild of the disks. This took about three hours to complete. Russ then started a program that checks for disk consistency and repairs any problems in the drives. This program was still running at 8pm – any recovery would have to wait until the program ended (many more hours). We made a decision to stop the program, format the drives and restore Fedora and Mulgara from a previous backup to speed up recovery.
We estimated that it would take ~3.5 hours to restore Fedora from backup. It took ~5.5 hours. Once complete, Russ brought the systems back online at ~2:50AM PST. Big thanks to Russ for babysitting the file server the whole day/night and for bringing up the system after the backups completed.
This is the first time that we had a major hardware malfunction to the large disk array and the first time we had to restore from a backup. While the disaster recovery plan worked, it took much longer than expected due to the size of the Fedora storage. We will look into solutions that enable a quicker disaster recovery. We are also meeting with UnitedLayer to discuss mitigation options.