Apology for Server Failure

I am very sorry for the inconvenience to all of you.

Our team has been working day and night to fix this, and I'm trying (as much as I am able!) to resist digging into the how? and why? too much until after we are through the crisis and have all the systems restored.

Be assured, though, that we will thoroughly examine this incident and take from it a lot of lessons and make the necessary changes.

We understand that our products have changed from 'needing to sync once in a while' to 'really needing to connect', and we are re-architecting both our infrastructure and our applications to reflect that, and to be more robust against failure. Some months ago we added complete offline backup/restore capability to Proclaim, so that it can run and even move presentations between systems without the Internet. Our desktop and mobile apps should work offline, except for online-only features (which Logos 6 introduced more of), and books that aren't downloaded.

(I've seen more reports than I expected of issues with the mobile app; the mobile apps should use any downloaded books just fine when there's no Internet -- I read in our mobile apps on the plane last week (to ETS and AAR/SBL conferences, where I am now) without any issues. But I know that iOS in particular can make its own decision to delete non-user-created data, like our downloaded ebooks, to manage space on the device, and when we get through this we'll look carefully at the source of mobile issues to see if this was the cause of unexpected missing books (and if so, how we can prevent it), or if there's something else going on.)

We have recently made dramatic changes in our back-end systems, particularly in the area of storage management, in anticipation of Logos 6. We needed both more online storage (for increased user count, the huge databases of map titles for Atlas, the media-related features, etc.) and increased reliability, in light of the 'Internet-only' nature of many new features.

Our team has long desired to have a completely mirrored back-end infrastructure, in which every server and every database is live-mirrored on a hot-switchable system. This, of course, requires essentially a 100% increase in systems cost, which runs into the hundreds of thousands of dollars. And it's still not a perfect system, because while it protects against hardware failures, it exactly replicates data/software problems to the secondary machine.

For a long time I resisted this, because the cost was extraordinary and we had high confidence that any failure could be addressed in just a few hours with replacement hardware (kept on hand) and/or restoring from backups. We felt the risk of a few hours vs. the ongoing doubling of (already expensive) systems-cost wasn't worth it.

As our model has changed, though, we've realized that we need to be able to guarantee up-time and reliability, as many of you have correctly pointed out in forum posts this weekend. We also realized that our existing platform was prohibitively expensive to scale up to our future storage needs, and that our existing data center in Bellingham (professionally outfitted, but small by today's data center standards) wasn't enough to be our only solution.

So we designed a new architecture based on the latest trends in data center management, and using some new (to us) platforms. We also arranged for a secondary location south of Seattle (100 miles away from Bellingham) where we could run systems that we could quickly switch to. The secondary data center was set up, provisioned with hardware, and planned to be ready for the Logos 6 launch.

(Yes, we did discuss the idea of having the second data center even further away; a lot of discussion went into hosting it in Phoenix, near our Tempe office, so that any northwest catastrophe couldn't wipe out both centers. But there was some advantage to being able to have the team visit the second location in person at times, and it was hard to imagine a scenario in which both Tukwila and Bellingham were both catastrophically destroyed and unrecoverable for a long period of time.)

All that remained was the high-bandwidth data link which would allow these two data centers to be completely and directly peered, without the same capacity constraints that were already an issue with being in the smaller Bellingham data center. We've been waiting for weeks for this data link to go live (it needs to be connected by a third party), and this weekend's failure happened before it did.

In theory, even without the secondary data center we should have been able to recover in hours. We employ technologies that should report drive and component failures, and which support hot-swapping of components, spares of which we keep on hand.

The report I have is that the tiered configuration of clusters -- clusters of drives represented as one drive to a node, clusters of nodes reported as one system to a higher-level system, etc. -- obscured a low-level error. A component at the bottom failed, and then two others, but the small failures were somehow obscured by the redundancy built into the system -- at the higher level the whole system was fine when the first small piece failed, an it didn't report the low-level failure. (This may be a mistake in our architecture, or in our configuration of the reporting software.) Then other components failed before we knew to replace the first failed component, and 'critical mass' was achieved.

This weekend the goal has been to get back up and running. Our team worked very hard to ensure that Proclaim was live for the majority of users. The fact that the Tukwila data center was only 100 miles away meant we were able to drive there and fetch the redundant equipment on a Saturday and start deploying new hardware quickly. Another team member dove to a Seattle computer store and bought every hard drive in stock so we could build new storage arrays. The team has been going on very little sleep and working hard to restore everything. I'm told there's no data loss -- just a deployment problem.

(There are some technical details I'm still fuzzy on here, but apparently another issue relates to the number and size of data objects we have, and the fact that it takes longer than we anticipated to move the huge amount of data we now manage around when you swap new storage components into the system.)

I'm providing a very high-level overview of the problem, and I am simplifying the description and providing it as I understand it right now. I could be describing it wrong, and once we know more about what happened we'll post a better explanation. (For example, I'm told that the storage system was back up relatively soon, but despite reporting health simply wasn't performing as it should. We don't yet know why.)

Ultimately the fault is mine.

Whatever the technical investigation reveals, I am ultimately responsible to deliver the products and services you rely on, and I let you down.

I am so sorry for inconveniencing so many of you this weekend, and I promise you that we will try to live out our company values (http://www.slideshare.net/BobPritchett/faithlife-corporate-culture-40614282) and turn these mistakes into learning experiences and return to delivering an Awesome experience and delighting our customers.

-- Bob

Find more posts tagged with

Comments

Frederick Schaffner

Bob,

Thank you so much for your continued work on Logos. Having been with Logos since the very first product I have been more than pleased with the quality of service and quality of software.

I trust you will soon find all the remedies needed to make this great product fully functional again.

Blessings,

Rick Schaffner

Dr. Nathan Parker

Thanks Bob for the report, and I wouldn't say you "let us down" at all. Your team has been working overtime to get things back online for us, and I was still able to access my critical books for school and my critical Bible study tools. I had to change a few study habits during the meantime, but nothing that wasn't simple enough to do, and I'm just thankful for all the hard work you've all put in. Keep up the great work! Hang in there!

Veli Voipio

https://community.logos.com/discussion/comment/667103#Comment_667103

Well, today my Internet connection is very slow, and it turned out that the biggest Internet operator in Finland has problems in the overseas connections. They found that an excavator had cut an optical cable to Sweden. This means that missing redundancy somewhere in the network can happen to anyone [:(]

Matt Swank

So basically we have been sold (and payed good money for) a cloud based system that didn't actually have proper cloud infrastructure. Interesting.

Having an international cloud business with two database centers only 100mi from each other is hardly robust. The US has had power blackouts for whole regions for multiple days. While it's definitely not common, it does happen. When this happens you are not just affecting a customer base in the US but around the world. If you are really going to sell this as an international cloud based system then you should have redundancy in other parts of the world. Based on this experience it seems like I'm paying for one thing and getting something else.

I do really like the Logos software. However, I am finding it hard to justify upgrading to Logos 6 or buying other books and commentaries when there isn't proper infrastructure. Which is really disappointing since I was looking forward to both the upgrade and adding a couple commentaries this Christmas.

Matthew C Jones

https://community.logos.com/discussion/comment/675425#Comment_675425

The US has had power blackouts for whole regions for multiple days.

I understand your dissatisfaction but hyperbole does not help your point. Days??

Matt Swank

https://community.logos.com/discussion/comment/675429#Comment_675429

A few years ago there was a power outage that covered the North East, parts of the Midwest and Ontario. It affected millions (literally not hyperbole) and for many of us it lasted 4 days.

Rosie Perera

https://community.logos.com/discussion/comment/675429#Comment_675429

The US has had power blackouts for whole regions for multiple days.

I understand your dissatisfaction but hyperbole does not help your point. Days??

It's not hyperbole. Where my cousins live in Connecticut they've have had blackouts lasting up to a week, in the winter no less, twice in the past couple of years. Parts of Long Island were without power for a week or more after Hurricane Sandy. A major earthquake in the Pacific Northwest (the anticipated "Big One") could cause widespread power outages for days.

On the other hand, Faithlife's response to this data center failure was to redouble their efforts to set up redundancy in more than one geographic location. They were already on a path to do that before this failure happened, and I'm guessing they are working as hard as they can to have it finalized before something like this happens again. It would be nice to hear an update on that.

JT (alabama24)

https://community.logos.com/discussion/comment/675429#Comment_675429

The US has had power blackouts for whole regions for multiple days.

I understand your dissatisfaction but hyperbole does not help your point. Days??

What is hyperbole about that? It is true!

Matthew C Jones

https://community.logos.com/discussion/comment/675482#Comment_675482

The blackout I remember in the North-East did not last several days. (Did it?) And that scale of an outage is extremely rare.

JT (alabama24)

https://community.logos.com/discussion/comment/675488#Comment_675488

The blackout I remember in the North-East did not last several days. (Did it?) And that scale of an outage is extremely rare.

I don't know which "blackout" you remembered. [:P]

Rare? It depends upon what you mean by "rare." They probably happen in the US several times every year for various reasons.

Matt Swank

https://community.logos.com/discussion/comment/675488#Comment_675488

I lived it. Some had power in a day, some two days but there were still many who didn't have it for up to 4 days, including myself. This is why true cloud based systems need redundancy in different regions of at the very least the country and probably the world. I get that Logos isn't a massive company and that the cost of infrastructure is high. However, if you are going to sell yourself as a cloud based system then you need the appropriate infrastructure. If the cost is too high then Logos needs to either find solutions other than building their own data centers, or they need to not sell cloud based services around the world.

SteveF

https://community.logos.com/discussion/comment/675488#Comment_675488

Dear "Super"

[I have appreciated your posts throughout the years-back to the earlier non-forum version?]

But the summer of that Eastern Black-out, I was on my way eastward [from Alberta] towards my home in South Western Ontario. On hearing of the "blackout" I deliberately delayed my trip, taking extra time traveling through Minnesota, Wisconsin [and its "Dells"] and Illinois. Even then I still had to go our of my way [northwards of Detroit] to miss still blacked out areas.

And that scale of an outage is extremely rare.

[hopefully, I agree]

DMB

https://community.logos.com/discussion/comment/675494#Comment_675494

I vote for the backup servers to be here in AZ. We never experience power outages, simply because Phoenix would otherwise melt. Plus Phoenix (or was that phoenix's) are verified by the apostolic fathers to quickly resurrect anyway (a major prooftext for Jesus' own). So, no problem.

Mathew Haferkamp

https://community.logos.com/discussion/comment/675491#Comment_675491

Dear Matt, I can understand some concern about a blackout but most of what do in the software does not require internet. I mean all of your books are on your computer, and you can still search. I would have to say 80% or more of what you do does not need internet. But I would like to hear what you would be missing out on.

Matthew C Jones

https://community.logos.com/discussion/comment/675491#Comment_675491

I lived it.

It must have been prior to 2112. I suffered a bit of memory loss that year. I will take your word on it. My apologies for challenging you.

However, if you are going to sell yourself as a cloud based system then you need the appropriate infrastructure. If the cost is too high then Logos needs to either find solutions other than building their own data centers, or they need to not sell cloud based services around the world.

I am with you on this 100%.

Rosie Perera

https://community.logos.com/discussion/comment/675509#Comment_675509

It must have been prior to 2112. I suffered a bit of memory loss that year.

I seem to have suffered memory loss that year too, because I can't remember it at all. In fact I could swear it hasn't even happened yet. [;)]

Matt Swank

https://community.logos.com/discussion/comment/675501#Comment_675501

True most, if not all, of your books are on your computer. However, many of us use mobile devices more than the computer. While we can have our books on our devices, many features will not work unless you are connected to the internet. It's also my understanding that many features in Logos 6 are also dependent on cloud services. Then there's proclaim. It is almost fully dependent on cloud services. Yes they have introduced a work around but that defeats the primary reason for using Proclaim. It is sold as a cloud based service. Without that function you may as well use one of the many other options out there for worship presentation. The other options that do not have cloud based service do not have a monthly subscription.

On a side note many people lost access to their books on their mobile devices.

Matt Swank

https://community.logos.com/discussion/comment/675509#Comment_675509

Yah It was before 2012. It's always good to be challenged. [:)]

Matthew C Jones

https://community.logos.com/discussion/comment/675517#Comment_675517

it hasn't even happened yet.

Reminds me of the Moody Blues' "Days of Future Passed."

Mathew Haferkamp

https://community.logos.com/discussion/comment/675518#Comment_675518

Well I am not talking about proclaim but I do understand on mobile devices, if you haven't downloaded it onto your device. But you still fail to list any important features to you that you would be missing out on. I have been using logos for 7 or 8 years now and until the last year I used it most of the time without an internet connection. Sure you are going to miss out on some features but I had no trouble doing my reading and research for bible study, or sermon prep. And unless I missed out on it I can't recall another time when the servers were down for an extended amount of time.

Matt Swank

https://community.logos.com/discussion/comment/675523#Comment_675523

First of all, logos isn't just logos desktop bible software anymore. When that's all that they were, there was no real issue with a server crash.

Secondly, I personally rarely use my computer. I use my ipad. The reason I invested in Logos is because of what I could do on my ipad with it. You can't do word search or use other tools like your greek and hebrew tools without internet connection. And for many people, even though they downloaded their books to their device they were unable to use them.

As for Proclaim. That's a big deal. It's part of the Logos company and it is one of the services that basically quit working. When that is down it effects your entire worship service.

Donnie Hale

https://community.logos.com/discussion/comment/675509#Comment_675509

2112

By then we'll be more concerned with the priests of the temples of Syrinx. (possibly my favorite album of all time

-Donnie

JoshInRI

Merry Christmas Mr. Pritchett...YOU personally may be the reason I retain version 7.0.

Glory to God!

Steve Maling

https://community.logos.com/discussion/comment/675616#Comment_675616

Regarding mobile devices and server failure: my solution is a tablet running Windows 8.1 so I can have Logos 6 offline. In my case, an Asus Vivo Tab Note 8 with 64 GB. The available GB may not be enough for those of you with massive libraries. For the rest of us, this solution answers the frequently posted frustrations with functions absent from the iPad (and I suppose from the Android).

abondservant

https://community.logos.com/discussion/comment/675616#Comment_675616

In 2004, we had a series of hurricanes hitting just weeks apart that left us without power for two, two week increments. Lost power for two weeks, it came back up for 6 or 8 days, and then bam, back down again. In 1993, the blizzard of '93 (or where I lived in Florida - the no-name storm, and now called the storm of the century by many) came up almost out of nowhere with little warning, and did considerably more damage than anyone could have expected. Its effects were felt from Canada to South America, and its epicenter was about 3 miles north of where I lived at the time. We had six feet of water in our home, which was itself several feet off the ground. Coast guard boats traveled in our front yard, and over our dock. As we were swimming (!) to safety, we saw dolphins crossing the highway.

In NC where I live now, if it rains hard people forget how to drive and run over telephone poles, and all sorts of things. If it snows? forget it. Don't even leave the house; and don't plan on having power throughout the entirety of the snow fall either. 1/2 inch of snow and an entire state capital shuts down.

In Florida we were prepared for that - a couple generators, a couple weeks worth of food, and water. Propane to cook on, and so forth. The worst part was the heat.

There are plenty of examples like this people can point to as reasons for offsite back ups. There are so many eventualities though (giant super-volcano erupts killing millions, causing second ice age, asteroid strikes can cause extinction level event, and on and on and on), that they all can't be accounted for. What about low yield nuclear device being detonated over Colorado, from which the resulting EMP would leave america in the dark ages for quite some time (some estimates fall into the 10-20 year range). I guess my point in all this is that only so much can be done to protect and prepare for the unlikely. Sure logos could have a server farm in the countryside of every major continent on the planet. Wonder how much extra that would make our books cost?

If we listen to the alarmists amoungst us life as we know it totters around the balance point back and forth between extinction for mankind and our relative safety. With threats coming from every vector, and with such scary news selling better than peaceful news; at some point we have to accept that there are some eventualities that just aren't feasible to predict, plan, or prepare for and roll with the punches when they come.

This minor though annoying event (why is this still one of the top posts?) from two weeks ago now (?) maybe three weeks(?) - probably could have been prevented. Bob apologized, had a new plan in the works (redundant data center), and in light of what they learned, they made some changes to their setup to reflect a better (if not implicitly best) practice.

If the worst part of our day was that we had to preach from memory, or paper notes - then praise God for His mercies. If you weren't hungry, dying, or watching someone you love go hungry or die then praise God for His mercies.

Kudos Logos from learning from this. Keep up the good work.

Lee

https://community.logos.com/discussion/comment/675624#Comment_675624

There is nothing alarmist about suggesting that a backup be sited more than 100 miles away, or possibly in another continent. One doesn't need to compare that suggestion with canards about the hungry and dying either. Why the smear and the pseudo-spiritual appeals to extreme?

Setting up a robust cloud structure is expensive and complicated. People have made billions doing this "simple" thing. Since F.L. has made cloud-based features a part of their product offering, they will have gone through the process of rationalization, cost-benefit analysis and planning, and will also take new circumstances to account.

I'm also sure that Bob or someone at F.L. will reveal future developments, in due time.