Apology for Server Failure

I am very sorry for the inconvenience to all of you.

Our team has been working day and night to fix this, and I'm trying (as much as I am able!) to resist digging into the how? and why? too much until after we are through the crisis and have all the systems restored.

Be assured, though, that we will thoroughly examine this incident and take from it a lot of lessons and make the necessary changes.

We understand that our products have changed from 'needing to sync once in a while' to 'really needing to connect', and we are re-architecting both our infrastructure and our applications to reflect that, and to be more robust against failure. Some months ago we added complete offline backup/restore capability to Proclaim, so that it can run and even move presentations between systems without the Internet. Our desktop and mobile apps should work offline, except for online-only features (which Logos 6 introduced more of), and books that aren't downloaded.

(I've seen more reports than I expected of issues with the mobile app; the mobile apps should use any downloaded books just fine when there's no Internet -- I read in our mobile apps on the plane last week (to ETS and AAR/SBL conferences, where I am now) without any issues. But I know that iOS in particular can make its own decision to delete non-user-created data, like our downloaded ebooks, to manage space on the device, and when we get through this we'll look carefully at the source of mobile issues to see if this was the cause of unexpected missing books (and if so, how we can prevent it), or if there's something else going on.)

We have recently made dramatic changes in our back-end systems, particularly in the area of storage management, in anticipation of Logos 6. We needed both more online storage (for increased user count, the huge databases of map titles for Atlas, the media-related features, etc.) and increased reliability, in light of the 'Internet-only' nature of many new features.

Our team has long desired to have a completely mirrored back-end infrastructure, in which every server and every database is live-mirrored on a hot-switchable system. This, of course, requires essentially a 100% increase in systems cost, which runs into the hundreds of thousands of dollars. And it's still not a perfect system, because while it protects against hardware failures, it exactly replicates data/software problems to the secondary machine.

For a long time I resisted this, because the cost was extraordinary and we had high confidence that any failure could be addressed in just a few hours with replacement hardware (kept on hand) and/or restoring from backups. We felt the risk of a few hours vs. the ongoing doubling of (already expensive) systems-cost wasn't worth it.

As our model has changed, though, we've realized that we need to be able to guarantee up-time and reliability, as many of you have correctly pointed out in forum posts this weekend. We also realized that our existing platform was prohibitively expensive to scale up to our future storage needs, and that our existing data center in Bellingham (professionally outfitted, but small by today's data center standards) wasn't enough to be our only solution.

So we designed a new architecture based on the latest trends in data center management, and using some new (to us) platforms. We also arranged for a secondary location south of Seattle (100 miles away from Bellingham) where we could run systems that we could quickly switch to. The secondary data center was set up, provisioned with hardware, and planned to be ready for the Logos 6 launch.

(Yes, we did discuss the idea of having the second data center even further away; a lot of discussion went into hosting it in Phoenix, near our Tempe office, so that any northwest catastrophe couldn't wipe out both centers. But there was some advantage to being able to have the team visit the second location in person at times, and it was hard to imagine a scenario in which both Tukwila and Bellingham were both catastrophically destroyed and unrecoverable for a long period of time.)

All that remained was the high-bandwidth data link which would allow these two data centers to be completely and directly peered, without the same capacity constraints that were already an issue with being in the smaller Bellingham data center. We've been waiting for weeks for this data link to go live (it needs to be connected by a third party), and this weekend's failure happened before it did.

In theory, even without the secondary data center we should have been able to recover in hours. We employ technologies that should report drive and component failures, and which support hot-swapping of components, spares of which we keep on hand.

The report I have is that the tiered configuration of clusters -- clusters of drives represented as one drive to a node, clusters of nodes reported as one system to a higher-level system, etc. -- obscured a low-level error. A component at the bottom failed, and then two others, but the small failures were somehow obscured by the redundancy built into the system -- at the higher level the whole system was fine when the first small piece failed, an it didn't report the low-level failure. (This may be a mistake in our architecture, or in our configuration of the reporting software.) Then other components failed before we knew to replace the first failed component, and 'critical mass' was achieved.

This weekend the goal has been to get back up and running. Our team worked very hard to ensure that Proclaim was live for the majority of users. The fact that the Tukwila data center was only 100 miles away meant we were able to drive there and fetch the redundant equipment on a Saturday and start deploying new hardware quickly. Another team member dove to a Seattle computer store and bought every hard drive in stock so we could build new storage arrays. The team has been going on very little sleep and working hard to restore everything. I'm told there's no data loss -- just a deployment problem.

(There are some technical details I'm still fuzzy on here, but apparently another issue relates to the number and size of data objects we have, and the fact that it takes longer than we anticipated to move the huge amount of data we now manage around when you swap new storage components into the system.)

I'm providing a very high-level overview of the problem, and I am simplifying the description and providing it as I understand it right now. I could be describing it wrong, and once we know more about what happened we'll post a better explanation. (For example, I'm told that the storage system was back up relatively soon, but despite reporting health simply wasn't performing as it should. We don't yet know why.)

Ultimately the fault is mine.

Whatever the technical investigation reveals, I am ultimately responsible to deliver the products and services you rely on, and I let you down.

I am so sorry for inconveniencing so many of you this weekend, and I promise you that we will try to live out our company values (http://www.slideshare.net/BobPritchett/faithlife-corporate-culture-40614282) and turn these mistakes into learning experiences and return to delivering an Awesome experience and delighting our customers.

-- Bob

Find more posts tagged with

Comments

David Taylor, Jr.

Praying for you guys all weekend Bob. And thank you for being willing to take the blame, that says a lot about you as a CEO and as a Person.

Erwin Stull, Sr.

Thanks Bob, for the report.

I knew that you were going to beat yourself up on this (being the CEO), but I want to encourage you to not do so (that's probably going in one ear and out the other).

I believe that the majority of us Logos users understand that these things happen, and sometimes catastrophic, however, there is a recovery, and you are moving in that direction.

I, for one, will remain a Logos supporter.

Be blessed.

Kenneth Neighoff

https://community.logos.com/discussion/comment/666315#Comment_666315

Bob,

Thank you for your remarks regarding the server failure.

Bruce Dunning

Bob, thank you for your transparency regarding this significant issue. I appreciate that you "own" this and am confident that you and your team will do what it takes to learn from this experience and continue to make future products even better. May the Lord give you the wisdom and strength you need for the days ahead.

Rick

Bob, Faithlife is an awesome corporation and I still feel that way. Thanks for the great explanation and keep up the good work. I know that others were probably affected more than I was, but it felt kind of good to read from my paper Bible rather than the iPad For a couple of days.

toughski

https://community.logos.com/discussion/comment/666324#Comment_666324

Bob, praying for you and your teams.

David Loo

Thanks for the update. I work in IT myself, and I know full well the potential system and hardware failures that can happen. We do the best we can, and there's always something we can do better.

Our church is poor (space rented from another church, no internet, etc.), so we don't run Proclaim (just Powerpoint from my own laptop). I always made sure everything can be run offline; I use a tablet for studies and reading, but I don't preach from it.

Look forward to everything being back online again. Keep up the good work. My prayers are with you and all the IT people hard at work getting things back to normal.

David

J David Shuttleworth

Thanks for the info. I have been using Logos since Libronrix days. I do not remember any type of problem like this before. You all have outstanding customer service!! I woukd like to see people walk in love!! Does Christ condemn? I think not!! Thanks for all of your and your amazing staff's hard work!! God Bless!![Y][Y]

Lee

So we designed a new architecture based on the latest trends in data center management, and using some new (to us) platforms. We also arranged for a secondary location south of Seattle ...

All that remained was the high-bandwidth data link which would allow these two data centers to be completely and directly peered, without the same capacity constraints that were already an issue with being in the smaller Bellingham data center. We've been waiting for weeks for this data link to go live (it needs to be connected by a third party), and this weekend's failure happened before it did.

I don't think you need to peer everything in real-time. There may be cost-saving alternatives.

Alan Kinder

If only life were perfect...

I was a qualified reactor operator for the US Navy for many years. Our systems were robust and redundant. They were supposed to tell us what was happening...yet the systems were also extremely complex. A simple failure could often lead to complete shutdowns...that were extremely difficult to diagnose and correct.

In other words, life isn't perfect, and when you deal with large, complex systems, it will often bite you. Thanks for all of the hard work to get it back up and running as quickly as possible.

mab

https://community.logos.com/discussion/comment/666329#Comment_666329

We're blessed by your honest and hands-on management. It's clearly seen by the dedication of you and your staff.

Thank you so much!

Blessings

Rich

https://community.logos.com/discussion/comment/666340#Comment_666340

Thanks Bob. Things happen and when bad things happen they provide opportunity to learn and grow. I appreciate everyone's hard work.

Earl Sheneman

https://community.logos.com/discussion/comment/666340#Comment_666340

The last seven years of my career with the Federal Aviation Administration I worked with automation systems in air traffic control. As important as air traffic control is we had systems failures there just like anywhere else. What we did was try and learn from it, did our very best to make sure it didn't happen again, and moved on.

Thanks for being willing to take responsibility for this Bob but please don't beat yourself up too much over this. It is something to learn from and show you ways to make improvements to your system. Yes some people were inconvenienced but it is just something that happens with any technology company from time to time whether we are talking about Microsoft, Apple, or any other company.

Thanks for being willing to come on the forums like you do, be upfront with your customers, and for building great Bible software!

Randy Lane

Appreciate the report Bob.

I can empathize with everything being a data center worker myself.

Reminds Jesus asid the alittle yeast leavens the whole loaf.

Small failure sinks the whole ship.

Alexxy Olu

Bob, thanks for speaking to us and being so transparent at this tough time. I appreciate your honesty and hard work and will continue to pray for all of you as I remain an ardent user of Logos. God bless you and your team as you continue your work of completely restoring the whole system.

Pedro

https://community.logos.com/discussion/comment/666340#Comment_666340

Thank you Bob & all at Faithlife.

Dan Francis

https://community.logos.com/discussion/comment/666364#Comment_666364

Thank you Bob for the explanation. As I said elsewhere this was an inconvenience not a disaster. Fortunately for me other than a delay in getting a free book Verbum 6 functioned perfectly for me (as I did not wish to make a visual copy or use the atlas this blip could have gone unnoticed by me had i not been an user of the forums). I am glad you are getting things back up to 100% and going ahead with the strategy that should make this event a one time occurrence. I know this incident just happened a few weeks too soon, murphy's law works that way. I remember over a decade ago, I had my mac backed up except my iTunes. I had bought the CDROMS to back them up, even reorganized my music into 750 mb folders the day before i was going to do it my hard drive died. It was an extreme pain needing to redo all my mp3s, but i also made sure I got a bigger back up hard drive so I always have a few backup of my machine. God bless you and all your employees.

-Dan

Veli Voipio

https://community.logos.com/discussion/comment/666329#Comment_666329

Thanks for the update. I work in IT myself, and I know full well the potential system and hardware failures that can happen. We do the best we can, and there's always something we can do better. ..... Look forward to everything being back online again. Keep up the good work. My prayers are with you and all the IT people hard at work getting things back to normal.

Similar thoughts, and I´ll remain a customer

Dave Hooton

I am very sorry for the inconvenience to all of you.

That is much appreciated, together with the level of detail.

Cynthia Feenstra

https://community.logos.com/discussion/comment/666412#Comment_666412

Hello Mr. Pritchett:

99% of what you said above is so far over my head, but truth be told, the most important words I saw were "I apologize..." Things happen, and the blessing in disguise here is that you never would have learned all you learned if it was not for what you just went through. Isn't that the purpose of a trial.

"Consider it JOY my brethren when you encounter various trials, knowing that the testing of your faith produces endurance..."

I too was praying for all the workers of Faithlife and am happy to see you back online. Get some sleep now, and ask the "why" questions TOMORROW!!!

Blessings,

Cynthia Feenstra

SteveHD

https://community.logos.com/discussion/comment/666412#Comment_666412

Thank You!

Cynthia in Florida

https://community.logos.com/discussion/comment/666421#Comment_666421

Hello Mr. Pritchett:

99% of what you said above is so far over my head, but truth be told, the most important words I saw were "I apologize..." Things happen, and the blessing in disguise here is that you never would have learned all you learned if it was not for what you just went through. Isn't that the purpose of a trial.

"Consider it JOY my brethren when you encounter various trials, knowing that the testing of your faith produces endurance..."

I too was praying for all the workers of Faithlife and am happy to see you back online. Get some sleep now, and ask the "why" questions TOMORROW!!!

Blessings,

Cynthia Feenstra

Some how I got signed in under a different email. One I don't even use here (But did I guess at the beginning of my account) Hmmm...anyway, I am re-posting this so you know who I am.

Mike Childs

I have total confidence in Logos as a product and as a company. The way they handle a crisis such as this simply reaffirms my confidence in them. Thanks for all you do for us.

Kevin Maples

I am so sorry for inconveniencing so many of you this weekend, and I promise you that we will try to live out our company values (http://www.slideshare.net/BobPritchett/faithlife-corporate-culture-40614282) and turn these mistakes into learning experiences and return to delivering an Awesome experience and delighting our customers.

I think you are doing an outstanding job! When you are depending on technology things happen sometimes. I totally understand. I wish everyone else did. Keep up the great work! No complaints here about anything.

Michael

Ultimately the fault is mine.

Whatever the technical investigation reveals, I am ultimately responsible to deliver the products and services you rely on, and I let you down.

-- Bob

Words of a true leader!

Peter Bongers

Thanks for the explanation and the apology... not sure you can take 100% of the blame though. That you have plans to protect against this happening again and that you have learned through this is encouraging. Thanks to everyone for working so hard to get the system back "up" and running! [Y]

GregW

https://community.logos.com/discussion/comment/666412#Comment_666412

I am very sorry for the inconvenience to all of you.

That is much appreciated, together with the level of detail.

Absolutely, and thank you Bob for the transparency and honesty in your reply. Nearly a year after one of the biggest banks in the UK had a massive failure that took out all its cash dispensers and payment systems, we still haven't had an explanation as clear and open as this one.

Jim Snowden

Thanks Bob for the detailed explanation and apology.

Randall Hartman

https://community.logos.com/discussion/comment/666461#Comment_666461

This is one of the reasons I love Logos!!! For the record, I've never received an apology from any other CEO...

Erwin Stull, Sr.

https://community.logos.com/discussion/comment/666498#Comment_666498

This is one of the reasons I love Logos!!! For the record, I've never received an apology from any other CEO...

[Y]

It is very, very rare (in the world) that you will see a CEO take personal ownership of an issue, and make a public apology, unless there is absolutely no choice (it is seen as a sign of weakness). Logos/Bob can place on paper all he/they want that Logos is strictly a business and nothing more, but I continue to see (and appreciate) a ministry/business. His ways (as seen) of conducting business, and his mannerism shows a very different picture than just business only.

I am with Bob and Logos regardless of pretty much what negative things may occur.

Geo Philips

https://community.logos.com/discussion/comment/666508#Comment_666508

Thanks for the explanation Bob.

It is a testament to your character to provide an apology on such a public forum that allows us to react and write things that may not be in your control.

Thanks to the team for working through the weekend. Hope everyone has a great Thanksgiving!

Graham Criddle

I am very sorry for the inconvenience to all of you.

Be assured, though, that we will thoroughly examine this incident and take from it a lot of lessons and make the necessary changes.

Thanks Bob - appreciated

Joseph Turner

https://community.logos.com/discussion/comment/666541#Comment_666541

Thanks Bob! I hope you guys are able to find a solution that works without having to spend a fortune.

James Hiddle

https://community.logos.com/discussion/comment/666508#Comment_666508

This is one of the reasons I love Logos!!! For the record, I've never received an apology from any other CEO...

It is very, very rare (in the world) that you will see a CEO take personal ownership of an issue, and make a public apology, unless there is absolutely no choice (it is seen as a sign of weakness). Logos/Bob can place on paper all he/they want that Logos is strictly a business and nothing more, but I continue to see (and appreciate) a ministry/business. His ways (as seen) of conducting business, and his mannerism shows a very different picture than just business only.

I am with Bob and Logos regardless of pretty much what negative things may occur.

Agreed! In this corporate world you hardly ever see a CEO not taking personal responsibility for his company when it comes to issues like this and to come to an open forum and apologize for something that wasn't even his fault.

Bob humbled himself in front of us and we should be thankful for that. God bless a CEO for doing that and God bless Bob always.

Ed Stone

https://community.logos.com/discussion/comment/666562#Comment_666562

Thank you for the update! Your hands on approach is yet another reason (among many others) why I continue to be a loyal long-time Logos customer! Praying for you all!

Rector: Trinity Anglican Church, Sarnia

Sounds like it has been a very demanding weekend for you, Bob, and your team. You are in prayer. We are grateful for your ministry -

Tes

https://community.logos.com/discussion/comment/666578#Comment_666578

God bless you Logos. May the Good Lord guide and give you more capability in handling all the wisdom you need to apply in your activities.

abondservant

https://community.logos.com/discussion/comment/666588#Comment_666588

Thank you Bob!

Bob Pritchett

Thank you for the kind words and prayers of support and encouragement.

(For example, I'm told that the storage system was back up relatively soon, but despite reporting health simply wasn't performing as it should. We don't yet know why.)

It turns out that we discovered a significant bug in the storage framework; that's what has led an embarrassing (but theoretically straightforwardly recoverable) hardware component failure to turn into a continuing nightmare: the system 'locked down' write access to the servers to protect data integrity even after the failed hardware was replaced, and we've been unable to get it out of this lockdown. (The services that are back up are on new hardware / systems, etc.)

Jim Straatman has posted more technical details at https://community.logos.com/forums/t/96630.aspx and I'm confident he and the team will be able to get it back up with the help of Inktank's engineers, now that the bug has been identified.

Email continues to be down -- giving me a strange 'how can I still be breathing? yet I am!' feeling -- so if you need to reach me feel free to use bobpritchett@gmail.com or a direct tweet @BobPritchett. (We have a five day inbound email buffer through our outside spam screener, so emails you've sent while we're down are queued up and will deliver once service is restored.)

Lee

https://community.logos.com/discussion/comment/666642#Comment_666642

Bugs, only entomologists like them.

I hope you have a service contract that covers economic loss or some form of compensation!

Jacques

I am very sorry for the inconvenience to all of you.

-- Bob

Please accept our apology also. When we get frustrated and start throwing our toys around and make harsh comments we tend to forget we are part of the Faithlife community. We all enjoy the good times, and sometimes, struggle to patiently endure the bad times - which is part and parcel of this life.

Thanks to you and the team for the sacrifices made in this time.

John O'Malley

As a one week old Logos user – there is grace for you and your team.

MJ. Smith

https://community.logos.com/discussion/comment/666681#Comment_666681

Welcome John ... your spelling and typing is phenomenal for a week old infant ... [;)] Yes, I know what you really meant. It's clearly time for me to wrap up the forum reading for the night. Nice to end on a post that is gracious.

Jack Caviness

Whatever the technical investigation reveals, I am ultimately responsible to deliver the products and services you rely on, and I let you down.

You realize, of course, that this disqualifies you from ever entering American politics. No politician ever takes responsibility for his own failures, much less for those under his supervision [:D]

Thank you and your team for the dedication to customer service and for the hard work putting things back to normal. Also thank you for the explanation(s) of what happened.

James Hiddle

https://community.logos.com/discussion/comment/666697#Comment_666697

Whatever the technical investigation reveals, I am ultimately responsible to deliver the products and services you rely on, and I let you down.

You realize, of course, that this disqualifies you from ever entering American politics. No politician ever takes responsibility for his own failures, much less for those under his supervision

Thank you and your team for the dedication to customer service and for the hard work putting things back to normal. Also thank you for the explanation(s) of what happened.

Yeah show me a politician that takes responsibility for his mistakes and I'll show you a hobbit riding a unicorn while singing karaoke [:D]

John O'Malley

https://community.logos.com/discussion/comment/666682#Comment_666682

[:D]

John O'Malley

https://community.logos.com/discussion/comment/666682#Comment_666682

[:D]

Erwin Stull, Sr.

https://community.logos.com/discussion/comment/666643#Comment_666643

Bugs, only entomologists like them.

I hope you have a service contract that covers economic loss or some form of compensation!

Many companies have Loss Insurance Policies in place. The thing is whether they want to use it or not. Using it more than likely means increased rates for life.

Lee

https://community.logos.com/discussion/comment/667037#Comment_667037

Bugs, only entomologists like them.

I hope you have a service contract that covers economic loss or some form of compensation!

Many companies have Loss Insurance Policies in place. The thing is whether they want to use it or not. Using it more than likely means increased rates for life.

Unfortunately, it sounds that the company did not have a comprehensive service contract for the fault in question.

Erwin Stull, Sr.

https://community.logos.com/discussion/comment/667084#Comment_667084

Bugs, only entomologists like them.

I hope you have a service contract that covers economic loss or some form of compensation!

Many companies have Loss Insurance Policies in place. The thing is whether they want to use it or not. Using it more than likely means increased rates for life.

Unfortunately, it sounds that the company did not have a comprehensive service contract for the fault in question.

That is unfortunate, which would only leave the options of paying out of pocket for the additional equipment and engineering, or to use the insurance policy (I'm pretty sure Logos has one). Both options can be very expensive.