Apology for Server Failure

Bob Pritchett
Bob Pritchett Member, Logos Employee Posts: 2,280
edited November 2024 in English Forum

I am very sorry for the inconvenience to all of you.

Our team has been working day and night to fix this, and I'm trying (as much as I am able!) to resist digging into the how? and why? too much until after we are through the crisis and have all the systems restored.

Be assured, though, that we will thoroughly examine this incident and take from it a lot of lessons and make the necessary changes.

We understand that our products have changed from 'needing to sync once in a while' to 'really needing to connect', and we are re-architecting both our infrastructure and our applications to reflect that, and to be more robust against failure. Some months ago we added complete offline backup/restore capability to Proclaim, so that it can run and even move presentations between systems without the Internet. Our desktop and mobile apps should work offline, except for online-only features (which Logos 6 introduced more of), and books that aren't downloaded.

(I've seen more reports than I expected of issues with the mobile app; the mobile apps should use any downloaded books just fine when there's no Internet -- I read in our mobile apps on the plane last week (to ETS and AAR/SBL conferences, where I am now) without any issues. But I know that iOS in particular can make its own decision to delete non-user-created data, like our downloaded ebooks, to manage space on the device, and when we get through this we'll look carefully at the source of mobile issues to see if this was the cause of unexpected missing books (and if so, how we can prevent it), or if there's something else going on.)

We have recently made dramatic changes in our back-end systems, particularly in the area of storage management, in anticipation of Logos 6. We needed both more online storage (for increased user count, the huge databases of map titles for Atlas, the media-related features, etc.) and increased reliability, in light of the 'Internet-only' nature of many new features.

Our team has long desired to have a completely mirrored back-end infrastructure, in which every server and every database is live-mirrored on a hot-switchable system. This, of course, requires essentially a 100% increase in systems cost, which runs into the hundreds of thousands of dollars. And it's still not a perfect system, because while it protects against hardware failures, it exactly replicates data/software problems to the secondary machine.

For a long time I resisted this, because the cost was extraordinary and we had high confidence that any failure could be addressed in just a few hours with replacement hardware (kept on hand) and/or restoring from backups. We felt the risk of a few hours vs. the ongoing doubling of (already expensive) systems-cost wasn't worth it.

As our model has changed, though, we've realized that we need to be able to guarantee up-time and reliability, as many of you have correctly pointed out in forum posts this weekend. We also realized that our existing platform was prohibitively expensive to scale up to our future storage needs, and that our existing data center in Bellingham (professionally outfitted, but small by today's data center standards) wasn't enough to be our only solution.

So we designed a new architecture based on the latest trends in data center management, and using some new (to us) platforms. We also arranged for a secondary location south of Seattle (100 miles away from Bellingham) where we could run systems that we could quickly switch to. The secondary data center was set up, provisioned with hardware, and planned to be ready for the Logos 6 launch.

(Yes, we did discuss the idea of having the second data center even further away; a lot of discussion went into hosting it in Phoenix, near our Tempe office, so that any northwest catastrophe couldn't wipe out both centers. But there was some advantage to being able to have the team visit the second location in person at times, and it was hard to imagine a scenario in which both Tukwila and Bellingham were both catastrophically destroyed and unrecoverable for a long period of time.)

All that remained was the high-bandwidth data link which would allow these two data centers to be completely and directly peered, without the same capacity constraints that were already an issue  with being in the smaller Bellingham data center. We've been waiting for weeks for this data link to go live (it needs to be connected by a third party), and this weekend's failure happened before it did.

In theory, even without the secondary data center we should have been able to recover in hours. We employ technologies that should report drive and component failures, and which support hot-swapping of components, spares of which we keep on hand.

The report I have is that the tiered configuration of clusters -- clusters of drives represented as one drive to a node, clusters of nodes reported as one system to a higher-level system, etc. -- obscured a low-level error. A component at the bottom failed, and then two others, but the small failures were somehow obscured by the redundancy built into the system -- at the higher level the whole system was fine when the first small piece failed, an it didn't report the low-level failure. (This may be a mistake in our architecture, or in our configuration of the reporting software.) Then other components failed before we knew to replace the first failed component, and 'critical mass' was achieved. 

This weekend the goal has been to get back up and running. Our team worked very hard to ensure that Proclaim was live for the majority of users. The fact that the Tukwila data center was only 100 miles away meant we were able to drive there and fetch the redundant equipment on a Saturday and start deploying new hardware quickly. Another team member dove to a Seattle computer store and bought every hard drive in stock so we could build new storage arrays. The team has been going on very little sleep and working hard to restore everything. I'm told there's no data loss -- just a deployment problem.

(There are some technical details I'm still fuzzy on here, but apparently another issue relates to the number and size of data objects we have, and the fact that it takes longer than we anticipated to move the huge amount of data we now manage around when you swap new storage components into the system.)

I'm providing a very high-level overview of the problem, and I am simplifying the description and providing it as I understand it right now. I could be describing it wrong, and once we know more about what happened we'll post a better explanation. (For example, I'm told that the storage system was back up relatively soon, but despite reporting health simply wasn't performing as it should. We don't yet know why.)

Ultimately the fault is mine.

Whatever the technical investigation reveals, I am ultimately responsible to deliver the products and services you rely on, and I let you down. 

I am so sorry for inconveniencing so many of you this weekend, and I promise you that we will try to live out our company values (http://www.slideshare.net/BobPritchett/faithlife-corporate-culture-40614282) and turn these mistakes into learning experiences and return to delivering an Awesome experience and delighting our customers.

-- Bob

«1

Comments

  • David Taylor, Jr.
    David Taylor, Jr. Member Posts: 4,386 ✭✭✭

    Praying for you guys all weekend Bob.  And thank you for being willing to take the blame, that says a lot about you as a CEO and as a Person.

  • Erwin Stull, Sr.
    Erwin Stull, Sr. Member Posts: 2,793 ✭✭✭

    Thanks Bob, for the report.

    I knew that you were going to beat yourself up on this (being the CEO), but I want to encourage you to not do so (that's probably going in one ear and out the other).

    I believe that the majority of us Logos users understand that these things happen, and sometimes catastrophic, however, there is a recovery, and you are moving in that direction.

    I, for one, will remain a Logos supporter.

    Be blessed.

  • Kenneth Neighoff
    Kenneth Neighoff Member Posts: 2,635 ✭✭✭

    Bob,

    Thank you for your remarks regarding the server failure.

  • Bruce Dunning
    Bruce Dunning MVP Posts: 11,161

    Bob, thank you for your transparency regarding this significant issue. I appreciate that you "own" this and am confident that you and your team will do what it takes to learn from this experience and continue to make future products even better. May the Lord give you the wisdom and strength you need for the days ahead.

    Using adventure and community to challenge young people to continually say "yes" to God

  • Rick
    Rick Member Posts: 2,018 ✭✭✭

    Bob, Faithlife is an awesome corporation and I still feel that way. Thanks for the great explanation and keep up the good work. I know that others were probably affected more than I was, but it felt kind of good to read from my paper Bible rather than the iPad For a couple of days.

  • toughski
    toughski Member Posts: 1,288 ✭✭✭
  • David Loo
    David Loo Member Posts: 37 ✭✭

    Thanks for the update. I work in IT myself, and I know full well the potential system and hardware failures that can happen. We do the best we can, and there's always something we can do better.

    Our church is poor (space rented from another church, no internet, etc.), so we don't run Proclaim (just Powerpoint from my own laptop). I always made sure everything can be run offline; I use a tablet for studies and reading, but I don't preach from it.

    Look forward to everything being back online again. Keep up the good work. My prayers are with you and all the IT people hard at work getting things back to normal.

    David

  • J David Shuttleworth
    J David Shuttleworth Member Posts: 94 ✭✭

    Thanks for the info. I have been using Logos since Libronrix days. I do not remember any type of problem like this before. You all have outstanding customer service!! I woukd like to see people walk in love!! Does Christ condemn? I think not!! Thanks for all of your and your amazing staff's hard work!! God Bless!![Y][Y]

  • Lee
    Lee Member Posts: 2,714 ✭✭✭

    So we designed a new architecture based on the latest trends in data center management, and using some new (to us) platforms. We also arranged for a secondary location south of Seattle ...

    All that remained was the high-bandwidth data link which would allow these two data centers to be completely and directly peered, without the same capacity constraints that were already an issue  with being in the smaller Bellingham data center. We've been waiting for weeks for this data link to go live (it needs to be connected by a third party), and this weekend's failure happened before it did.

    I don't think you need to peer everything in real-time. There may be cost-saving alternatives.

  • Alan Kinder
    Alan Kinder Member Posts: 11 ✭✭

    If only life were perfect...

    I was a qualified reactor operator for the US Navy for many years.  Our systems were robust and redundant.  They were supposed to tell us what was happening...yet the systems were also extremely complex.  A simple failure could often lead to complete shutdowns...that were extremely difficult to diagnose and correct.

    In other words, life isn't perfect, and when you deal with large, complex systems, it will often bite you.  Thanks for all of the hard work to get it back up and running as quickly as possible.

  • mab
    mab Member Posts: 3,071 ✭✭✭

    We're blessed by your honest and hands-on management. It's clearly seen by the dedication of you and your staff.

    Thank you so much!

    Blessings

    The mind of man is the mill of God, not to grind chaff, but wheat. Thomas Manton | Study hard, for the well is deep, and our brains are shallow. Richard Baxter

  • Rich
    Rich Member Posts: 140 ✭✭

    Thanks Bob. Things happen and when bad things happen they provide opportunity to learn and grow. I appreciate everyone's hard work.

    Doer of Things

    Logos 7 Bronze

    13" MBA 1.7 GHz Intel Core i7, 8 RAM, 512 SSD

    27" iMAC 3.1 GHz Intel Core i5, 8 RAM, 1TB SATA Disk

    iPad Air 2, iPhone 6

  • Earl Sheneman
    Earl Sheneman Member Posts: 102 ✭✭

    The last seven years of my career with the Federal Aviation Administration I worked with automation systems in air traffic control. As important as air traffic control is we had systems failures there just like anywhere else. What we did was try and learn from it, did our very best to make sure it didn't happen again, and moved on.

    Thanks for being willing to take responsibility for this Bob but please don't beat yourself up too much over this. It is something to learn from and show you ways to make improvements to your system. Yes some people were inconvenienced but it is just something that happens with any technology company from time to time whether we are talking about Microsoft, Apple, or any other company.

    Thanks for being willing to come on the forums like you do, be upfront with your customers, and for building great Bible software!

  • Randy Lane
    Randy Lane Member Posts: 163 ✭✭

    Appreciate the report Bob.

    I can empathize with everything being a data center worker myself.

    Reminds Jesus asid the alittle yeast leavens the whole loaf.

    Small failure sinks the whole ship.

  • Alexxy Olu
    Alexxy Olu Member Posts: 250 ✭✭

    Bob, thanks for speaking to us and being so transparent at this tough time. I appreciate your honesty and hard work and will continue to pray for all of you as I remain an ardent user of Logos. God bless you and your team as you continue your work of completely restoring the whole system.

  • Pedro
    Pedro Member Posts: 155 ✭✭
  • Dan Francis
    Dan Francis Member Posts: 5,336 ✭✭✭

    Thank you Bob for the explanation.  As I said elsewhere this was an inconvenience not a disaster. Fortunately for me other than a delay in getting a free book Verbum 6 functioned perfectly for me (as I did not wish to make a visual copy or use the atlas this blip could have gone unnoticed by me had i not been an user of the forums). I am glad you are getting things back up to 100% and going ahead with the strategy that should make this event a one time occurrence. I know this incident just happened a few weeks too soon, murphy's law works that way. I remember over a decade ago, I had my mac backed up except my iTunes. I had bought the CDROMS to back them up, even reorganized my music into 750 mb folders the day before i was going to do it my hard drive died. It was an extreme pain needing to redo all my mp3s, but i also made sure I got a bigger back up hard drive so I always have a few backup of my machine. God bless you and all your employees.

    -Dan

  • Veli Voipio
    Veli Voipio MVP Posts: 2,074

    David Loo said:

    Thanks for the update. I work in IT myself, and I know full well the potential system and hardware failures that can happen. We do the best we can, and there's always something we can do better. ..... Look forward to everything being back online again. Keep up the good work. My prayers are with you and all the IT people hard at work getting things back to normal.

    Similar thoughts, and I´ll remain a customer

    Gold package, and original language material and ancient text material, SIL and UBS books, discourse Hebrew OT and Greek NT. PC with Windows 11

  • Dave Hooton
    Dave Hooton MVP Posts: 36,174

    I am very sorry for the inconvenience to all of you.

    That is much appreciated, together with the level of detail.

    Dave
    ===

    Windows 11 & Android 13

  • Cynthia Feenstra
    Cynthia Feenstra Member Posts: 10 ✭✭

    Hello Mr. Pritchett:

    99% of what you said above is so far over my head, but truth be told, the most important words I saw were "I apologize..."  Things happen, and the blessing in disguise here is that you never would have learned all you learned if it was not for what you just went through.  Isn't that the purpose of a trial. 

    "Consider it JOY my brethren when you encounter various trials, knowing that the testing of your faith produces endurance..."

    I too was praying for all the workers of Faithlife and am happy to see you back online.  Get some sleep now, and ask the "why" questions TOMORROW!!!

    Blessings,

    Cynthia Feenstra

  • Cynthia in Florida
    Cynthia in Florida Member Posts: 821 ✭✭

    Hello Mr. Pritchett:

    99% of what you said above is so far over my head, but truth be told, the most important words I saw were "I apologize..."  Things happen, and the blessing in disguise here is that you never would have learned all you learned if it was not for what you just went through.  Isn't that the purpose of a trial. 

    "Consider it JOY my brethren when you encounter various trials, knowing that the testing of your faith produces endurance..."

    I too was praying for all the workers of Faithlife and am happy to see you back online.  Get some sleep now, and ask the "why" questions TOMORROW!!!

    Blessings,

    Cynthia Feenstra

    Some how I got signed in under a different email.  One I don't even use here (But did I guess at the beginning of my account)  Hmmm...anyway, I am re-posting this so you know who I am.

    Cynthia

    Romans 8:28-38

  • Mike Childs
    Mike Childs Member Posts: 3,135 ✭✭✭

    I have total confidence in Logos as a product and as a company.  The way they handle a crisis such as this simply reaffirms my confidence in them.  Thanks for all you do for us.


    "In all cases, the Church is to be judged by the Scripture, not the Scripture by the Church," John Wesley

  • Kevin Maples
    Kevin Maples Member Posts: 808 ✭✭

    I am so sorry for inconveniencing so many of you this weekend, and I promise you that we will try to live out our company values (http://www.slideshare.net/BobPritchett/faithlife-corporate-culture-40614282) and turn these mistakes into learning experiences and return to delivering an Awesome experience and delighting our customers.

     I think you are doing an outstanding job! When you are depending on technology things happen sometimes. I totally understand. I wish everyone else did. Keep up the great work! No complaints here about anything.  
  • Michael
    Michael Member Posts: 362 ✭✭

    Ultimately the fault is mine.

    Whatever the technical investigation reveals, I am ultimately responsible to deliver the products and services you rely on, and I let you down. 

    -- Bob

    Words of a true leader!

  • Peter Bongers
    Peter Bongers Member Posts: 46 ✭✭

    Thanks for the explanation and the apology... not sure you can take 100% of the blame though. That you have plans to protect against this happening again and that you have learned through this is encouraging. Thanks to everyone for working so hard to get the system back "up" and running! [Y]

  • GregW
    GregW Member Posts: 848 ✭✭

    I am very sorry for the inconvenience to all of you.

    That is much appreciated, together with the level of detail.

    Absolutely, and thank you Bob for the transparency and honesty in your reply. Nearly a year after one of the biggest banks in the UK had a massive failure that took out all its cash dispensers and payment systems, we still haven't had an explanation as clear and open as this one. 


    Running Logos 6 Platinum and Logos Now on Surface Pro 4, 8 GB RAM, 256GB SSD, i5

  • Jim Snowden
    Jim Snowden Member Posts: 193 ✭✭

    Thanks Bob for the detailed explanation and apology. 

  • Randall Hartman
    Randall Hartman Member Posts: 502 ✭✭

    This is one of the reasons I love Logos!!! For the record, I've never received an apology from any other CEO...

  • Erwin Stull, Sr.
    Erwin Stull, Sr. Member Posts: 2,793 ✭✭✭

    This is one of the reasons I love Logos!!! For the record, I've never received an apology from any other CEO...

    [Y]

    It is very, very rare (in the world) that you will see a CEO take personal ownership of an issue, and make a public apology, unless there is absolutely no choice (it is seen as a sign of weakness). Logos/Bob can place on paper all he/they want that Logos is strictly a business and nothing more, but I continue to see (and appreciate) a ministry/business. His ways (as seen) of conducting business, and his mannerism shows a very different picture than just business only.

    I am with Bob and Logos regardless of pretty much what negative things may occur.

  • Geo Philips
    Geo Philips Member Posts: 401 ✭✭

    Thanks for the explanation Bob.

    It is a testament to your character to provide an apology on such a public forum that allows us to react and write things that may not be in your control.

    Thanks to the team for working through the weekend. Hope everyone has a great Thanksgiving!

  • Graham Criddle
    Graham Criddle MVP Posts: 33,253

    I am very sorry for the inconvenience to all of you.

    Be assured, though, that we will thoroughly examine this incident and take from it a lot of lessons and make the necessary changes.

    Thanks Bob - appreciated

  • Joseph Turner
    Joseph Turner Member Posts: 2,872 ✭✭✭

    Thanks Bob!  I hope you guys are able to find a solution that works without having to spend a fortune.  

    Disclaimer:  I hate using messaging, texting, and email for real communication.  If anything that I type to you seems like anything other than humble and respectful, then I have not done a good job typing my thoughts.

  • James Hiddle
    James Hiddle Member Posts: 792 ✭✭

    This is one of the reasons I love Logos!!! For the record, I've never received an apology from any other CEO...

    Yes

    It is very, very rare (in the world) that you will see a CEO take personal ownership of an issue, and make a public apology, unless there is absolutely no choice (it is seen as a sign of weakness). Logos/Bob can place on paper all he/they want that Logos is strictly a business and nothing more, but I continue to see (and appreciate) a ministry/business. His ways (as seen) of conducting business, and his mannerism shows a very different picture than just business only.

    I am with Bob and Logos regardless of pretty much what negative things may occur.

    Agreed! In this corporate world you hardly ever see a CEO not taking personal responsibility for his company when it comes to issues like this and to come to an open forum and apologize for something that wasn't even his fault.

    Bob humbled himself in front of us and we should be thankful for that. God bless a CEO for doing that and God bless Bob always.

  • Ed Stone
    Ed Stone Member Posts: 82 ✭✭

    Thank you for the update!  Your hands on approach is yet another reason (among many others) why I continue to be a loyal long-time Logos customer! Praying for you all!

    Ed

  • Sounds like it has been a very demanding weekend for you, Bob, and your team.  You are in prayer.  We are grateful for your ministry -

  • Tes
    Tes Member Posts: 4,035 ✭✭✭

    God bless you Logos. May the Good Lord guide and give you  more capability in handling all the wisdom you need to apply in your activities.

    Blessings in Christ.

  • abondservant
    abondservant Member Posts: 4,796 ✭✭✭

    L2 lvl4 (...) WORDsearch, all the way through L10,

  • Bob Pritchett
    Bob Pritchett Member, Logos Employee Posts: 2,280

    Thank you for the kind words and prayers of support and encouragement.

    (For example, I'm told that the storage system was back up relatively soon, but despite reporting health simply wasn't performing as it should. We don't yet know why.)

    It turns out that we discovered a significant bug in the storage framework; that's what has led an embarrassing (but theoretically straightforwardly recoverable) hardware component failure to turn into a continuing nightmare: the system 'locked down' write access to the servers to protect data integrity even after the failed hardware was replaced, and we've been unable to get it out of this lockdown. (The services that are back up are on new hardware / systems, etc.)

    Jim Straatman has posted more technical details at https://community.logos.com/forums/t/96630.aspx and I'm confident he and the team will be able to get it back up with the help of Inktank's engineers, now that the bug has been identified.

    Email continues to be down -- giving me a strange 'how can I still be breathing? yet I am!' feeling -- so if you need to reach me feel free to use bobpritchett@gmail.com or a direct tweet @BobPritchett. (We have a five day inbound email buffer through our outside spam screener, so emails you've sent while we're down are queued up and will deliver once service is restored.)

  • Lee
    Lee Member Posts: 2,714 ✭✭✭

    Bugs, only entomologists like them.

    I hope you have a service contract that covers economic loss or some form of compensation!

  • Jacques
    Jacques Member Posts: 30 ✭✭

    I am very sorry for the inconvenience to all of you.

    -- Bob

    Please accept our apology also.  When we get frustrated and start throwing our toys around and make harsh comments we tend to forget we are part of the Faithlife community. We all enjoy the good times, and sometimes, struggle to patiently endure the bad times - which is part and parcel of this life.

    Thanks to you and the team for the sacrifices made in this time.

  • John O'Malley
    John O'Malley Member Posts: 24 ✭✭

    As a one week old Logos user – there is grace for you and your team. 

  • MJ. Smith
    MJ. Smith MVP Posts: 54,977

    Welcome John ... your spelling and typing is phenomenal for a week old infant ... [;)] Yes, I know what you really meant. It's clearly time for me to wrap up the forum reading for the night. Nice to end on a post that is gracious.

    Orthodox Bishop Alfeyev: "To be a theologian means to have experience of a personal encounter with God through prayer and worship."; Orthodox proverb: "We know where the Church is, we do not know where it is not."

  • Jack Caviness
    Jack Caviness MVP Posts: 13,608

    Whatever the technical investigation reveals, I am ultimately responsible to deliver the products and services you rely on, and I let you down. 

    You realize, of course, that this disqualifies you from ever entering American politics. No politician ever takes responsibility for his own failures, much less for those under his supervision [:D]

    Thank you and your team for the dedication to customer service and for the hard work putting things back to normal. Also thank you for the explanation(s) of what happened.

  • James Hiddle
    James Hiddle Member Posts: 792 ✭✭

    Whatever the technical investigation reveals, I am ultimately responsible to deliver the products and services you rely on, and I let you down. 

    You realize, of course, that this disqualifies you from ever entering American politics. No politician ever takes responsibility for his own failures, much less for those under his supervision Big Smile

    Thank you and your team for the dedication to customer service and for the hard work putting things back to normal. Also thank you for the explanation(s) of what happened.

    Yeah show me a politician that takes responsibility for his mistakes and I'll show you a hobbit riding a unicorn while singing karaoke [:D] 

  • Erwin Stull, Sr.
    Erwin Stull, Sr. Member Posts: 2,793 ✭✭✭

    Lee said:

    Bugs, only entomologists like them.

    I hope you have a service contract that covers economic loss or some form of compensation!

    Many companies have Loss Insurance Policies in place. The thing is whether they want to use it or not. Using it more than likely means increased rates for life.

  • Lee
    Lee Member Posts: 2,714 ✭✭✭

    Lee said:

    Bugs, only entomologists like them.

    I hope you have a service contract that covers economic loss or some form of compensation!

    Many companies have Loss Insurance Policies in place. The thing is whether they want to use it or not. Using it more than likely means increased rates for life.

    Unfortunately, it sounds that the company did not have a comprehensive service contract for the fault in question.

  • Erwin Stull, Sr.
    Erwin Stull, Sr. Member Posts: 2,793 ✭✭✭

    Lee said:

    Lee said:

    Bugs, only entomologists like them.

    I hope you have a service contract that covers economic loss or some form of compensation!

    Many companies have Loss Insurance Policies in place. The thing is whether they want to use it or not. Using it more than likely means increased rates for life.

    Unfortunately, it sounds that the company did not have a comprehensive service contract for the fault in question.

    That is unfortunate, which would only leave the options of paying out of pocket for the additional equipment and engineering, or to use the insurance policy (I'm pretty sure Logos has one). Both options can be very expensive.