I am very sorry for the inconvenience to all of you.
Our team has been working day and night to fix this, and I'm trying (as much as I am able!) to resist digging into the how? and why? too much until after we are through the crisis and have all the systems restored.
Be assured, though, that we will thoroughly examine this incident and take from it a lot of lessons and make the necessary changes.
We understand that our products have changed from 'needing to sync once in a while' to 'really needing to connect', and we are re-architecting both our infrastructure and our applications to reflect that, and to be more robust against failure. Some months ago we added complete offline backup/restore capability to Proclaim, so that it can run and even move presentations between systems without the Internet. Our desktop and mobile apps should work offline, except for online-only features (which Logos 6 introduced more of), and books that aren't downloaded.
(I've seen more reports than I expected of issues with the mobile app; the mobile apps should use any downloaded books just fine when there's no Internet -- I read in our mobile apps on the plane last week (to ETS and AAR/SBL conferences, where I am now) without any issues. But I know that iOS in particular can make its own decision to delete non-user-created data, like our downloaded ebooks, to manage space on the device, and when we get through this we'll look carefully at the source of mobile issues to see if this was the cause of unexpected missing books (and if so, how we can prevent it), or if there's something else going on.)
We have recently made dramatic changes in our back-end systems, particularly in the area of storage management, in anticipation of Logos 6. We needed both more online storage (for increased user count, the huge databases of map titles for Atlas, the media-related features, etc.) and increased reliability, in light of the 'Internet-only' nature of many new features.
Our team has long desired to have a completely mirrored back-end infrastructure, in which every server and every database is live-mirrored on a hot-switchable system. This, of course, requires essentially a 100% increase in systems cost, which runs into the hundreds of thousands of dollars. And it's still not a perfect system, because while it protects against hardware failures, it exactly replicates data/software problems to the secondary machine.
For a long time I resisted this, because the cost was extraordinary and we had high confidence that any failure could be addressed in just a few hours with replacement hardware (kept on hand) and/or restoring from backups. We felt the risk of a few hours vs. the ongoing doubling of (already expensive) systems-cost wasn't worth it.
As our model has changed, though, we've realized that we need to be able to guarantee up-time and reliability, as many of you have correctly pointed out in forum posts this weekend. We also realized that our existing platform was prohibitively expensive to scale up to our future storage needs, and that our existing data center in Bellingham (professionally outfitted, but small by today's data center standards) wasn't enough to be our only solution.
So we designed a new architecture based on the latest trends in data center management, and using some new (to us) platforms. We also arranged for a secondary location south of Seattle (100 miles away from Bellingham) where we could run systems that we could quickly switch to. The secondary data center was set up, provisioned with hardware, and planned to be ready for the Logos 6 launch.
(Yes, we did discuss the idea of having the second data center even further away; a lot of discussion went into hosting it in Phoenix, near our Tempe office, so that any northwest catastrophe couldn't wipe out both centers. But there was some advantage to being able to have the team visit the second location in person at times, and it was hard to imagine a scenario in which both Tukwila and Bellingham were both catastrophically destroyed and unrecoverable for a long period of time.)
All that remained was the high-bandwidth data link which would allow these two data centers to be completely and directly peered, without the same capacity constraints that were already an issue with being in the smaller Bellingham data center. We've been waiting for weeks for this data link to go live (it needs to be connected by a third party), and this weekend's failure happened before it did.
In theory, even without the secondary data center we should have been able to recover in hours. We employ technologies that should report drive and component failures, and which support hot-swapping of components, spares of which we keep on hand.
The report I have is that the tiered configuration of clusters -- clusters of drives represented as one drive to a node, clusters of nodes reported as one system to a higher-level system, etc. -- obscured a low-level error. A component at the bottom failed, and then two others, but the small failures were somehow obscured by the redundancy built into the system -- at the higher level the whole system was fine when the first small piece failed, an it didn't report the low-level failure. (This may be a mistake in our architecture, or in our configuration of the reporting software.) Then other components failed before we knew to replace the first failed component, and 'critical mass' was achieved.
This weekend the goal has been to get back up and running. Our team worked very hard to ensure that Proclaim was live for the majority of users. The fact that the Tukwila data center was only 100 miles away meant we were able to drive there and fetch the redundant equipment on a Saturday and start deploying new hardware quickly. Another team member dove to a Seattle computer store and bought every hard drive in stock so we could build new storage arrays. The team has been going on very little sleep and working hard to restore everything. I'm told there's no data loss -- just a deployment problem.
(There are some technical details I'm still fuzzy on here, but apparently another issue relates to the number and size of data objects we have, and the fact that it takes longer than we anticipated to move the huge amount of data we now manage around when you swap new storage components into the system.)
I'm providing a very high-level overview of the problem, and I am simplifying the description and providing it as I understand it right now. I could be describing it wrong, and once we know more about what happened we'll post a better explanation. (For example, I'm told that the storage system was back up relatively soon, but despite reporting health simply wasn't performing as it should. We don't yet know why.)
Ultimately the fault is mine.
Whatever the technical investigation reveals, I am ultimately responsible to deliver the products and services you rely on, and I let you down.
I am so sorry for inconveniencing so many of you this weekend, and I promise you that we will try to live out our company values (http://www.slideshare.net/BobPritchett/faithlife-corporate-culture-40614282) and turn these mistakes into learning experiences and return to delivering an Awesome experience and delighting our customers.
-- Bob