A storage system that backs many core websites and services started failing at approximately 4:30 PM PST on Friday. The components that failed (hard drives & a RAID controller battery) have been replaced. However Ceph, the distributed object storage technology we use to run databases, application servers, and file storage requires a re-balance to heal. This healing process takes many hours, potentially up to 20, to distribute ~190 TB of data across 5 storage nodes (computers). The re-balance started around 1 pm PST, and once the re-balance completes, all systems will be back to normal.
We’re currently migrating core services off the failed storage system to expedite system availability. Commerce websites including Logos.com, Verbum.com, Vyrso.com, Noet.com, etc. should migrate within the next hour or two.
We’re taking the same steps with Proclaim to get it functioning in time for Sunday morning service.
Ceph, and other central storage technologies are common place in data center deployments. They are essentially many-eggs-in-one-basket architectures that are vulnerable to large scale failures. To reduce the impact of hardware failures, we’re researching decentralized storage options that minimize the dependence of core systems on any one component.
We’re also building a redundant data center deployment in south Seattle. This will enable rapid fail over when a system wide event occurs. Unfortunately, that second data center is several weeks/months from completion.
We’re very sorry for the inconvenience caused by this event, and appreciate your continued patience and support as we work to resolution. Know that we appreciate your business and are making every effort to improve system availability.