Data Center Storage Failure, 11/21 (Continued)

Jim Straatman

// 12:33 PM PST, 11/27

Email is back online and all data center systems are back in service.

// 11:14 AM PST, 11/27

Unspooling email from MXLogic.

// 10:51 AM PST, 11/27

Restored databases copied and mounted. Now testing.

// 12:50 AM PST, 11/27

The Exchange email server was able to mount the databases, but due to file corruption and quarantine issues, is frequently detaching and dropping email offline. Microsoft Support recommends defragging which will take ~10-15 hours per database, each in series. However, MXLogic spooling (where inbound email is queueing up) has a rolling 5 day window that cannot be adjusted, and waiting for defragging may put us beyond that limit, resulting in lost inbound mail from this week. (We started spooling mid-Monday.)

The best restore point we’ve got is from Friday at 3:00 PM PST. Email was functioning through around ~4:30 PM that day, which is when the SATA array began to fail. Restoring from 3 means loosing email between 3 - 4:30. We also loose internal messages from this evening since bringing email partially online. (We’ll keep the corrupted email files and can potentially acquire content from the gap.)

To avoid hitting the spooling limit, we’re going to take the Friday restore point from 3 pm. It’s the quickest option that brings email back online. We’re copying restore files from the backup facility, which is writing much faster than attempts on the emergency server. We’ll get some sleep, then mount databases, turn on backups, enable replication to the secondary server, and disable spooling. If all goes well, we’ll be done.

// 9:28 PM PST, 11/26

Third database recovered and mounted. 2 quarantined, 0 corrupt.

// 7:36 PM PST, 11/26

Double correction. Approximately 20 mailboxes in quarantine and < 5 corrupted in first two databases. We'll be running jobs to repair mailboxes in both states. Copy job is nearly complete. We'll then mount and repair. Expecting to turn email up before midnight.

// 6:32 PM PST, 11/26

Correction, it appears there's no corruption, but some mailboxes are in quarantine. This is easily resolved, so we're instead identifying quarantined mailboxes and correcting. Still waiting on last email database to copy.

// 5:37 PM PST, 11/26

Some mailboxes suffered corruption and require a repair from the backup. I don’t expect this operation will encounter the same slowdown that the emergency backup recovery experienced. We’re writing all the necessary scripts to identified failed mailboxes and initiate repair commands.

The third email database will take about 2 hours to copy from the secondary Exchange server. We’ll then run the repair analysis once that database is mounted.

Once all email databases are mounted and repaired, we’ll disable spooling with MXLogic and can start receiving inbound email again.

If all goes to plan, we’ll be fully recovered from the storage failure.

// 5:09 PM PST, 11/26

Third rebuild job complete, though we had to run it on the secondary Exchange server. We'll have to copy the repaired database over to the primary before mounting.

// 4:14 PM PST, 11/26

Second database mounted, verifying data.

// 3:29 PM PST, 11/26

Second email rebuild job complete. Working on mounting the database.

// 2:51 PM PST, 11/26

98% and 95% complete on second and third email repair jobs.

// 1:45 PM PST, 11/26

Send to Kindle is back online.

// 1:10 PM PST, 11/26

95% & 92% remaining complete on second and third repair jobs. Initial reports are that email is fully intact, but continuing to investigate.

// 12:23 PM PST, 11/26

One of three email restore jobs completed and the Exchange server was able to mount the database (hooray!). Still need to evaluate the data to ensure it's intact. Remaining two jobs still running.

// 10:50 AM PST, 11/26

While recovering the Atlas database we discovered some corruption. Repairing the drives, and worst case scenario, we'll rebuild from source data.

// 10:18 AM PST, 11/26

Microsoft confirmed we're doing the right thing for recovering Exchange.

// 9:56 AM PST, 11/26

On a call with Microsoft Support, relaying everything we've done to recover email and ensure we're on the right track.

// 9:36 AM PST, 11/26

Resource deployment complete, currently testing Send to Kindle.

// 9:22 AM PST, 11/26

Exchange repair jobs across the 3 email databases are between 80% - 90% through the "repairing damaged tables" step. If these jobs finish, and report healthy databases, we will mount the files and bring email online.

All attempts to copy recovery data from the backup facility to the emergency Joyent Exchange server are bottlenecked to 5MB/second, which is prohibitively slow for recovery. We don't know the cause of that slowdown, and with everything else going on, haven't had a chance to resolve. With the exception of a few internal systems and Send to Kindle, the rest of the infrastructure is stable, so our top Operations engineers will be exclusively focused on getting email back online. (Other development teams will continue to focus on remaining services.)

// 9:10 AM PST, 11/26

A summary from the night's events...

1:00 AM Interlinear docs sync restored (non customer-facing sync service)
1:13 AM Text Alignment sync is restoring (non customer-facing sync service
1:30 AM Some sync statistics logging is restored. Sync up and down events are still not being logged to Touch Points because that service is still unavailable. Events are queueing on the messaging server until Touch Points comes back online.

At about 4 AM, one of our messaging servers stopped responding. The outage caused both Prefs and Prayer Lists sync to error for about 25% of requests. Proclaim presentations and Notes documents sync depend on this messaging server to update revision numbers for all members in a group when a document is updated. When the server is unavailable, changes will not automatically sync over to other connected user accounts until that user makes a change to that document or presentation. The outage caused several sync services to take anywhere from 10-30 seconds to respond to sync requests, and due to the amount of traffic the preferences endpoints receives, one node was overwhelmed and stopped opening connections to the database, causing sync errors. Sync services normally handle a messaging outage gracefully, but due to a bug in prayer lists sync, that service threw 500’s instead.

7:30 AM messaging server recovered, sync is behaving normally again on all sync services. Sync statistics logging has been disabled until Touch Points is available again to prevent another messaging outage. All sync statistics metadata (system information, not user data) from Friday until now has been lost. Logging this data will be re-enabled when Touch Points is available again.

Still in progress:
Copying over resources to support Send to Kindle. We expect the copy to be finished in the next few hours. After the copy finishes, we’ll do some testing, and then light the service up.

// 1:06 AM PST, 11/26

The Exchange clean up utilities are taking a long time to run, and are expected to take a several more hours. We're going to get some rest, and get back to it in the morning. Email will not be back online in time for start of business day, but if the jobs finish over night and the files come clean, we'll be able to attach and turn up email.

// 12:03 AM PST, 11/26

Api.Biblia.com is back online.

// 11:53 PM PST, 11/25

The Exchange server reported a dirty shutdown across 3 of the 5 storage groups. We're running clean up utilities and are optimistic they will resolve and come online sometime late tonight. We're continuing the restore process on the backup deployment just in case.

// 11:26 PM PST, 11/25

In the last hour, the following have come online...

Manuscript services which serves manuscript search results in the desktop
Wiki services which serves Wikipedia to the desktop
Books.Logos.com and Books API - in addition to the site, there's "Send to Logos Desktop" and Books.Logos.com search
Sermons.Logos.com and WBSA.Logos.com are their sites

// 10:13 PM PST, 11/25

We're considering the Ceph cluster fully healed and closed the case with Redhat. There are a few remaining services we're still turing on. Also, the corporate email server requires some cleanup, which we're working through now.

// 9:24 PM PST, 11/25

Visual Copy is working again.

// 9:20 PM PST, 11/25

Author and resource services are back online.

// 9:13 PM PST, 11/25

Amber is back online, backed by the recovered cluster, and Verse of the Day API is returning media. Working on Media API which will light up Visual Copy.

// 6:49 PM PST, 11/25

Ceph cluster is recovered and we're able to connect to the system. We're now working trough drive cleanup (check disk), verifying data integrity and fixing bad sectors. Too early to celebrate, but solid progress.

// 6:06 PM PST, 11/25

Delete job complete, rebuild done. "This is good, this is very good."

// 5:25 PM PST, 11/25

Racked and added 3 compute hosts to Joyent SmartDataCenter including 16 TB of disk capacity, 576 GB of RAM, and 72 CPU cores.

// 5:21 PM PST, 11/25

Message delivery service API is back online. This service this allows us to send messages and alerts to desktop and mobile apps.

// 5:00 PM PST, 11/25

Placement group delete job ~80% complete.

// 3:35 PM PST, 11/25

Got RAM?

// 3:30 PM PST, 11/25

First email restore job to local drive is complete! Data in flight from back facility to data center, and started second restore job.

// 3:23 PM PST, 11/25

Team is staging high availability components for Joyent SmartDataCenter deployment. These additions will prevent a system outage in the event of a hardware failure on core management functions.

// 2:40 PM PST, 11/25

Inktank engineers have provided a new set of instructions for cleaning up placement group data. We're going through the process of making a placement group backup, so we have a copy of the current state. We'll then run some more intrusive delete commands to to try an eliminate problematic data. Last, we'll run some rebuild jobs to clean everything up. If that works, we'll come back online.

First email restore job is near completion, should be done in the next 30 minutes.

Hardware for additional Joyent capacity has been acquired and is in transit to Bellingham.

// 1:45 PM PST, 11/25

For those interested, here's the hardware.

// 1:20 PM PST, 11/25

Email recovery is looking more promising. We've direct attached a drive to the backup server and are restoring locally. Once the backup has been extracted from Microsoft Data Protection Manager, we can copy from our backup facility to the data center using rsync, a more reliable file transfer protocol. There's about an hour remaining on the first extraction, followed by a copy. We'll do this operation 3 times for each email storage group.

// 11:55 AM PST, 11/25

wiki.logos.com, wiki.faithilfe.com, and a number of other wikis came back online sometime late last night and were fully functional around 10 or 11 PM PST.

// 11:31 AM PST, 11/25

The Amber file restore is ~50% done and we estimate completion late this evening or early tomorrow morning, at which time Media API and Visual Copy will come back online.

The corporate email restore failed 10 hours into the process. Unfortunately, Microsoft Data Protection Manager only supports retry, not resume, so we initiated another restore last night around 11 PM PST. We’re also evaluating alternative methods for recovering the data, and will post progress as it’s known.

We’re sourcing 6 more servers from a variety of vendors today, and will be sending a team member to drive up and down I-5, from warehouse to warehouse, fetching essential hardware to expand the Joyent SmartDataCenter deployment. That equipment should be racked and provisioned late this afternoon or evening, providing the necessary capacity to continue migrating remaining APIs.

After 6 hours troubleshooting the failed storage cluster with a top level Inktank (Redhat) engineer, we were unable to bring the failed storage cluster back into service. The following description is our best understanding of the problem, but more analysis is required once the outage is over to fully understand what happened and insulate us from a repeat failure.

The failed Ceph cluster runs a diverse workload, including virtual operating systems, databases, email storage, and the file system that backs Amber. It’s configured with a replication factor of 3, meaning each file is broken up into smaller objects, which are then replicated 3 times across 6 available nodes.

The success of Logos 6 has driven significant pressure on the system, though within acceptable tolerance of service. However, we speculate that increased load triggered a Linux kernel bug with XFS (a high-performance 64-bit journaling file system) that corrupted placement group data, which is the mechanism Ceph uses to map and replicate objects onto storage nodes. The Inktank engineer verified that we rebuilt placement groups correctly which brought the system to 0% degraded, but the system appears to be hanging onto corrupted placement groups, even though they have been deleted. Therefore, Ceph’s data integrity system still regards the cluster as failed, and has paused any data coming in or out. Data integrity is a valuable attribute of a storage system—a safety mechanism that prevents writes from corrupting a degraded system. Inktank now believes there is a bug in the Ceph data integrity system where it considers placement groups incomplete, preventing proper recovery and keeping the system in blocking mode, even though the system is recovered.

We’ve purchased a support contract with Inktank/Redhat this morning (not cheap) and will have a dedicated team of engineers working around the clock to resolve the bug. We expect they’ll produce a custom version of the code with the necessary patch to recover from blocking mode. There’s no telling exactly what’s required or how much time it will take, so we’ll simultaneously be working to extract the data from the storage pool, enabling migration to alternative storage systems.

While this is happening, we’ll also continue to exercise our (admittedly lacking) disaster recovery processes, restoring from backups. Whichever effort wins, recovering the failed array or restoring from backup, will determine how remaining services come back online.

I just want to take a moment and express how blessed I am to be working with such a talented team. Inktank has many enterprise customers, and according to them, we (and by we I mean Richard) are one of the most sophisticated outfits they’ve encountered. So much so, they are discounting our support contract by 15%, representing the value we’ll provide back into their product. Moreover, the Faithlife Operations team is one of the first in the world to deploy Joyent SmartDataCenter 7 under new open source licensing terms, all under duress and in a matter of hours.

I also want to thank you, the customer, both for your continued support and honest feedback. The Faithlife Operations team considers it a privilege to work on such awesome products that you consider essential, and we deeply appreciate your business.

// 10:49 PM PST, 11/24

Discovered fundamental bug in the Ceph placement group architecture & data integrity system. The team needs sleep so we'll be away until the morning. Will post more then.

// 9:21 PM PST, 11/24

We started a call with Inktank engineer around 5 PM PST and making significant progress recovering the Ceph storage cluster. If this works, systems will recover much faster, and reduce pressure to migrate to alternative systems.

// 5:29 PM PST, 11/24

Custom dictionaries API is back online

// 5:14 PM PST, 11/24

Personal book builder upload API is back online.

// 4:10 PM PST, 11/24

User tagging API is back online.

// 3:37 PM PST, 11/24

The Ceph re-balance is complete and the SATA storage pool is benchmarking at 400 MB/sec, which is top performance for the system. However, writes are blocked for an unknown reason, probably a bug exposed by hardware failures and re-balancing. Ceph is an open source technology, and we’ve been active in the online community, but haven’t found an answer. We’re contracting Inktank, makers of Ceph, and will have one of their engineers on a call in the next few hours.

We’re continuing the parallel effort of restoring systems from backups. The corporate mail server is ~1/5th restored. We have 3 databases to restore and are currently working through the 1st. Once the 1st is complete, we’ll install the Exchange server, then start the second and third restores. There are still many hours until finished. (23 year old companies have a lot of email.)

Amber, the asset manager that backs Media API/Visual Copy, is currently restoring to a new storage system, a ZFS zpool. Again, the restore process is taking a lot of time, and is through 200 of 900 GB. Once complete, that system will come back online.

Most of the infrastructure provisioned for the 6 launch runs on OpenStack virtualization backed by Ceph storage. OpenStack (think many operating systems running on a single computer, multiplied by many computers) has been sufficient, though not excellent. Ceph, however, is proving insufficient for our environment. The Operations team was already evaluating alternative architectures, and favoring Joyent’s SmartDataCenter for virtualization paired with ZFS, a file system and logical volume manager designed by Sun Microsystems. We’ve been prototyping Joyent since it was open sourced early November and were planning to use it for the Telx, Tukwilla redundant deployment.

Fortunately, the team has gained enough expertise in SmartDataCenter to deploy an emergency instance which has been the migration target for APIs and applications servers. However, that effort has been held up today while we work through automated MySQL orchestration and some other deployment issues. (Sleep depravation not helping.) That’s why many APIs are still offline. I expect we’ll see some break throughs in the coming hours and look forward to making “XYZ API is back online” announcements soon.

BusinessDesk order processing was fixed around 8:30 AM this morning.

// 6:38 AM PST, 11/24

BusinessDesk, the internal order processing system, is unable to add SKUs to quotes for phone orders. Online orders at Logos.com, etc. are functioning. We'll have the internal system resolved early morning.

// 6:02 AM PST, 11/24

Ceph re-balance not complete. Initiated backup restoration of corporate email and Amber.

Also initiating database restore for systems that support Digital Library Online and remaining APIs.

// 1:34 AM PST, 11/24

Currently provisioning additional storage and compute capacity to recover Amber (Visual Copy/Verse of the Day) and corporate email. Once those disk intensive systems are provisioned and restore processes started, we'll return to remaining APIs that require redeployment.

Ceph re-balance 99.6% complete, but moving incredibly slow through final half percent.

// 11:24 PM PST, 11/23

The commerce engine came back online around 11 pm including Logos.com, Verbum.com, Vyrso.com, and Noet.com. Product browsing, purchasing, and account management are all functioning.

Device management and a few other smaller features are not yet working; we’ll get to those soon.

// 11:18 PM PST, 11/23

User inputs API is back online.

// 11:09 PM PST, 11/23

Word finds API is back online.

// 11:04 PM PST, 11/23

Shortcuts API is back online.

// 10:56 PM PST, 11/23

Sentence diagrams API is back online.

// 10:50 PM PST, 11/23

Self test API is back online.

// 10:46 PM PST, 11/23

Reading lists API is back online.

// 10:33 PM PST, 11/23

Personal book builder sync API is back online.

// 10:25 PM PST, 11/23

Passage lists API is back online.

// 10:20 PM PST, 11/23

Layouts API is back online.

// 10:17 PM PST, 11/23

History API is back online.

// 10:09 PM PST, 11/23

Highlights & handouts APIs are back online.

// 8:25 PM PST, 11/23

The storage system re-balance is running again and near completion. We expect it will recover sometime late this evening. The root cause of system degradation is still not known, though the hard drive and RAID battery failures are likely candidates. Once complete, we hope the system will come online and allow us to efficiently extract data and deploy across reliable and decentralized storage arrays currently being racked and provisioned at the data center.

Amber, the digital asset manager that backs Visual Copy, Verse of the Day, and a number of other media features is waiting on the new storage systems being racked. Same goes for our corporate email server. For both of these systems, recovering the failed storage system will expedite their availability; then we'll migrate behind the scenes. However, if recovery is not an option, these systems will be restored from a backup which takes longer.

The commerce engine is nearly online, pending an order fulfillment service that licenses resources after a purchase. Once that is deployed, we'll turn the sites back online.

BusinessDesk, our internal customer relationship management system, is mostly back online and should be functioning in time for business hours and phone support tomorrow.

Remaining APIs that support mobile, web, and desktop products are being worked on, but behind the commerce engine effort (for hardware and resource availability reasons).

// 7:35 PM PST, 11/23

Documents.Logos.com is back online.

// 6:55 PM PST, 11/23

Favorites API is back online.

// 6:48 PM PST, 11/23

Guides documents API is back online.

// 6:44 PM PST, 11/23

Code documents API is back online.

// 6:43 PM PST, 11/23

Copy Bible verses API is back online.

// 6:42 PM PST, 11/23

File storage API is back online. Faithlife.com documents tabs are now functioning.

// 6:18 PM PST, 11/23

Bibliography sync API is back online.

// 6:13 PM PST, 11/23

Clippings sync API is back online.

// 6:11 PM PST, 11/23

Reading progress API, which is responsible for the completion ring in the mobile library, is back online.

// 5:58 PM PST, 11/23

Docs API is back online, which supports the Faithlife.com documents tab and the Logos groups tool.

// 5:46 PM PST, 11/23

Prayer lists and reading plans APIs are back online.

// 5:37 PM PST, 11/23

Preference sync API is back online. This web service supports things like last position in a resource.

// 4:50 PM PST, 11/23

Note API is back online. Notes will now sync, and popular highlights are available.

// 2 PM PST, 11/23

The storage system recovery stalled at 60%. It does progress with heavy intervention (restarting sub systems), but it's moving very slowing. We are still attempting to bring it online, but are pivoting most efforts to alternative deployments. Client APIs (resource delivery, notes, highlights, favorites, maps, etc.) and the commerce engine (Logos.com, Verbum.com, etc.) are the top priorities. I'll update this thread as systems come online.

// ~10 AM PST, 11/23

The storage system that experienced component failure is still recovering, so many sites and services are still offline. We did recover Proclaim using an alternate hosting platform in time for the majority of users. We're currently migrating the commerce engine to the same deployment, but there are many more moving parts, so it's taking a bit longer.

APIs that back the Logos desktop, such as favorites, clippings, highlights, etc., are also a top priority this morning. We'd prefer to get those running on the recovering storage system, but may have to migrate depending on recovery progress.

// ~10 PM PST, 11/22

A storage system that backs many core websites and services started failing at approximately 4:30 PM PST on Friday. The components that failed (hard drives & a RAID controller battery) have been replaced. However Ceph, the distributed object storage technology we use to run databases, application servers, and file storage requires a re-balance to heal. This healing process takes many hours, potentially up to 20, to distribute ~190 TB of data across 5 storage nodes (computers). The re-balance started around 1 pm PST, and once the re-balance completes, all systems will be back to normal.

We’re currently migrating core services off the failed storage system to expedite system availability. Commerce websites including Logos.com, Verbum.com, Vyrso.com, Noet.com, etc. should migrate within the next hour or two.

We’re taking the same steps with Proclaim to get it functioning in time for Sunday morning service.

Ceph, and other central storage technologies are common place in data center deployments. They are essentially many-eggs-in-one-basket architectures that are vulnerable to large scale failures. To reduce the impact of hardware failures, we’re researching decentralized storage options that minimize the dependence of core systems on any one component.

We’re also building a redundant data center deployment in south Seattle. This will enable rapid fail over when a system wide event occurs. Unfortunately, that second data center is several weeks/months from completion.

We’re very sorry for the inconvenience caused by this event, and appreciate your continued patience and support as we work to resolution. Know that we appreciate your business and are making every effort to improve system availability.