Apology for Server Failure
I am very sorry for the inconvenience to all of you.
Our team has been working day and night to fix this, and I'm trying (as much as I am able!) to resist digging into the how? and why? too much until after we are through the crisis and have all the systems restored.
Be assured, though, that we will thoroughly examine this incident and take from it a lot of lessons and make the necessary changes.
We understand that our products have changed from 'needing to sync once in a while' to 'really needing to connect', and we are re-architecting both our infrastructure and our applications to reflect that, and to be more robust against failure. Some months ago we added complete offline backup/restore capability to Proclaim, so that it can run and even move presentations between systems without the Internet. Our desktop and mobile apps should work offline, except for online-only features (which Logos 6 introduced more of), and books that aren't downloaded.
(I've seen more reports than I expected of issues with the mobile app; the mobile apps should use any downloaded books just fine when there's no Internet -- I read in our mobile apps on the plane last week (to ETS and AAR/SBL conferences, where I am now) without any issues. But I know that iOS in particular can make its own decision to delete non-user-created data, like our downloaded ebooks, to manage space on the device, and when we get through this we'll look carefully at the source of mobile issues to see if this was the cause of unexpected missing books (and if so, how we can prevent it), or if there's something else going on.)
We have recently made dramatic changes in our back-end systems, particularly in the area of storage management, in anticipation of Logos 6. We needed both more online storage (for increased user count, the huge databases of map titles for Atlas, the media-related features, etc.) and increased reliability, in light of the 'Internet-only' nature of many new features.
Our team has long desired to have a completely mirrored back-end infrastructure, in which every server and every database is live-mirrored on a hot-switchable system. This, of course, requires essentially a 100% increase in systems cost, which runs into the hundreds of thousands of dollars. And it's still not a perfect system, because while it protects against hardware failures, it exactly replicates data/software problems to the secondary machine.
For a long time I resisted this, because the cost was extraordinary and we had high confidence that any failure could be addressed in just a few hours with replacement hardware (kept on hand) and/or restoring from backups. We felt the risk of a few hours vs. the ongoing doubling of (already expensive) systems-cost wasn't worth it.
As our model has changed, though, we've realized that we need to be able to guarantee up-time and reliability, as many of you have correctly pointed out in forum posts this weekend. We also realized that our existing platform was prohibitively expensive to scale up to our future storage needs, and that our existing data center in Bellingham (professionally outfitted, but small by today's data center standards) wasn't enough to be our only solution.
So we designed a new architecture based on the latest trends in data center management, and using some new (to us) platforms. We also arranged for a secondary location south of Seattle (100 miles away from Bellingham) where we could run systems that we could quickly switch to. The secondary data center was set up, provisioned with hardware, and planned to be ready for the Logos 6 launch.
(Yes, we did discuss the idea of having the second data center even further away; a lot of discussion went into hosting it in Phoenix, near our Tempe office, so that any northwest catastrophe couldn't wipe out both centers. But there was some advantage to being able to have the team visit the second location in person at times, and it was hard to imagine a scenario in which both Tukwila and Bellingham were both catastrophically destroyed and unrecoverable for a long period of time.)
All that remained was the high-bandwidth data link which would allow these two data centers to be completely and directly peered, without the same capacity constraints that were already an issue with being in the smaller Bellingham data center. We've been waiting for weeks for this data link to go live (it needs to be connected by a third party), and this weekend's failure happened before it did.
In theory, even without the secondary data center we should have been able to recover in hours. We employ technologies that should report drive and component failures, and which support hot-swapping of components, spares of which we keep on hand.
The report I have is that the tiered configuration of clusters -- clusters of drives represented as one drive to a node, clusters of nodes reported as one system to a higher-level system, etc. -- obscured a low-level error. A component at the bottom failed, and then two others, but the small failures were somehow obscured by the redundancy built into the system -- at the higher level the whole system was fine when the first small piece failed, an it didn't report the low-level failure. (This may be a mistake in our architecture, or in our configuration of the reporting software.) Then other components failed before we knew to replace the first failed component, and 'critical mass' was achieved.
This weekend the goal has been to get back up and running. Our team worked very hard to ensure that Proclaim was live for the majority of users. The fact that the Tukwila data center was only 100 miles away meant we were able to drive there and fetch the redundant equipment on a Saturday and start deploying new hardware quickly. Another team member dove to a Seattle computer store and bought every hard drive in stock so we could build new storage arrays. The team has been going on very little sleep and working hard to restore everything. I'm told there's no data loss -- just a deployment problem.
(There are some technical details I'm still fuzzy on here, but apparently another issue relates to the number and size of data objects we have, and the fact that it takes longer than we anticipated to move the huge amount of data we now manage around when you swap new storage components into the system.)
I'm providing a very high-level overview of the problem, and I am simplifying the description and providing it as I understand it right now. I could be describing it wrong, and once we know more about what happened we'll post a better explanation. (For example, I'm told that the storage system was back up relatively soon, but despite reporting health simply wasn't performing as it should. We don't yet know why.)
Ultimately the fault is mine.
Whatever the technical investigation reveals, I am ultimately responsible to deliver the products and services you rely on, and I let you down.
I am so sorry for inconveniencing so many of you this weekend, and I promise you that we will try to live out our company values (http://www.slideshare.net/BobPritchett/faithlife-corporate-culture-40614282) and turn these mistakes into learning experiences and return to delivering an Awesome experience and delighting our customers.
-- Bob
Comments
Thanks Bob, for the report.
I knew that you were going to beat yourself up on this (being the CEO), but I want to encourage you to not do so (that's probably going in one ear and out the other).
I believe that the majority of us Logos users understand that these things happen, and sometimes catastrophic, however, there is a recovery, and you are moving in that direction.
I, for one, will remain a Logos supporter.
Be blessed.
Bob, thank you for your transparency regarding this significant issue. I appreciate that you "own" this and am confident that you and your team will do what it takes to learn from this experience and continue to make future products even better. May the Lord give you the wisdom and strength you need for the days ahead.
Using adventure and community to challenge young people to continually say "yes" to God
Thanks for the update. I work in IT myself, and I know full well the potential system and hardware failures that can happen. We do the best we can, and there's always something we can do better.
Our church is poor (space rented from another church, no internet, etc.), so we don't run Proclaim (just Powerpoint from my own laptop). I always made sure everything can be run offline; I use a tablet for studies and reading, but I don't preach from it.
Look forward to everything being back online again. Keep up the good work. My prayers are with you and all the IT people hard at work getting things back to normal.
David
We're blessed by your honest and hands-on management. It's clearly seen by the dedication of you and your staff.
Thank you so much!
Blessings
The mind of man is the mill of God, not to grind chaff, but wheat. Thomas Manton | Study hard, for the well is deep, and our brains are shallow. Richard Baxter
The last seven years of my career with the Federal Aviation Administration I worked with automation systems in air traffic control. As important as air traffic control is we had systems failures there just like anywhere else. What we did was try and learn from it, did our very best to make sure it didn't happen again, and moved on.
Thanks for being willing to take responsibility for this Bob but please don't beat yourself up too much over this. It is something to learn from and show you ways to make improvements to your system. Yes some people were inconvenienced but it is just something that happens with any technology company from time to time whether we are talking about Microsoft, Apple, or any other company.
Thanks for being willing to come on the forums like you do, be upfront with your customers, and for building great Bible software!
Thank you Bob for the explanation. As I said elsewhere this was an inconvenience not a disaster. Fortunately for me other than a delay in getting a free book Verbum 6 functioned perfectly for me (as I did not wish to make a visual copy or use the atlas this blip could have gone unnoticed by me had i not been an user of the forums). I am glad you are getting things back up to 100% and going ahead with the strategy that should make this event a one time occurrence. I know this incident just happened a few weeks too soon, murphy's law works that way. I remember over a decade ago, I had my mac backed up except my iTunes. I had bought the CDROMS to back them up, even reorganized my music into 750 mb folders the day before i was going to do it my hard drive died. It was an extreme pain needing to redo all my mp3s, but i also made sure I got a bigger back up hard drive so I always have a few backup of my machine. God bless you and all your employees.
-Dan
Thanks for the update. I work in IT myself, and I know full well the potential system and hardware failures that can happen. We do the best we can, and there's always something we can do better. ..... Look forward to everything being back online again. Keep up the good work. My prayers are with you and all the IT people hard at work getting things back to normal.
Similar thoughts, and I´ll remain a customer
Gold package, and original language material and ancient text material, SIL and UBS books, discourse Hebrew OT and Greek NT. PC with Windows 11
Thanks for the info. I have been using Logos since Libronrix days. I do not remember any type of problem like this before. You all have outstanding customer service!! I woukd like to see people walk in love!! Does Christ condemn? I think not!! Thanks for all of your and your amazing staff's hard work!! God Bless!![Y][Y]
So we designed a new architecture based on the latest trends in data center management, and using some new (to us) platforms. We also arranged for a secondary location south of Seattle ...
All that remained was the high-bandwidth data link which would allow these two data centers to be completely and directly peered, without the same capacity constraints that were already an issue with being in the smaller Bellingham data center. We've been waiting for weeks for this data link to go live (it needs to be connected by a third party), and this weekend's failure happened before it did.
I don't think you need to peer everything in real-time. There may be cost-saving alternatives.
If only life were perfect...
I was a qualified reactor operator for the US Navy for many years. Our systems were robust and redundant. They were supposed to tell us what was happening...yet the systems were also extremely complex. A simple failure could often lead to complete shutdowns...that were extremely difficult to diagnose and correct.
In other words, life isn't perfect, and when you deal with large, complex systems, it will often bite you. Thanks for all of the hard work to get it back up and running as quickly as possible.
Hello Mr. Pritchett:
99% of what you said above is so far over my head, but truth be told, the most important words I saw were "I apologize..." Things happen, and the blessing in disguise here is that you never would have learned all you learned if it was not for what you just went through. Isn't that the purpose of a trial.
"Consider it JOY my brethren when you encounter various trials, knowing that the testing of your faith produces endurance..."
I too was praying for all the workers of Faithlife and am happy to see you back online. Get some sleep now, and ask the "why" questions TOMORROW!!!
Blessings,
Cynthia Feenstra
Hello Mr. Pritchett:
99% of what you said above is so far over my head, but truth be told, the most important words I saw were "I apologize..." Things happen, and the blessing in disguise here is that you never would have learned all you learned if it was not for what you just went through. Isn't that the purpose of a trial.
"Consider it JOY my brethren when you encounter various trials, knowing that the testing of your faith produces endurance..."
I too was praying for all the workers of Faithlife and am happy to see you back online. Get some sleep now, and ask the "why" questions TOMORROW!!!
Blessings,
Cynthia Feenstra
Some how I got signed in under a different email. One I don't even use here (But did I guess at the beginning of my account) Hmmm...anyway, I am re-posting this so you know who I am.
Cynthia
Romans 8:28-38
I am very sorry for the inconvenience to all of you.That is much appreciated, together with the level of detail.
Absolutely, and thank you Bob for the transparency and honesty in your reply. Nearly a year after one of the biggest banks in the UK had a massive failure that took out all its cash dispensers and payment systems, we still haven't had an explanation as clear and open as this one.
Running Logos 6 Platinum and Logos Now on Surface Pro 4, 8 GB RAM, 256GB SSD, i5
I am so sorry for inconveniencing so many of you this weekend, and I promise you that we will try to live out our company values (http://www.slideshare.net/BobPritchett/faithlife-corporate-culture-40614282) and turn these mistakes into learning experiences and return to delivering an Awesome experience and delighting our customers.I think you are doing an outstanding job! When you are depending on technology things happen sometimes. I totally understand. I wish everyone else did. Keep up the great work! No complaints here about anything.
This is one of the reasons I love Logos!!! For the record, I've never received an apology from any other CEO...
[Y]
It is very, very rare (in the world) that you will see a CEO take personal ownership of an issue, and make a public apology, unless there is absolutely no choice (it is seen as a sign of weakness). Logos/Bob can place on paper all he/they want that Logos is strictly a business and nothing more, but I continue to see (and appreciate) a ministry/business. His ways (as seen) of conducting business, and his mannerism shows a very different picture than just business only.
I am with Bob and Logos regardless of pretty much what negative things may occur.
This is one of the reasons I love Logos!!! For the record, I've never received an apology from any other CEO...
It is very, very rare (in the world) that you will see a CEO take personal ownership of an issue, and make a public apology, unless there is absolutely no choice (it is seen as a sign of weakness). Logos/Bob can place on paper all he/they want that Logos is strictly a business and nothing more, but I continue to see (and appreciate) a ministry/business. His ways (as seen) of conducting business, and his mannerism shows a very different picture than just business only.
I am with Bob and Logos regardless of pretty much what negative things may occur.
Agreed! In this corporate world you hardly ever see a CEO not taking personal responsibility for his company when it comes to issues like this and to come to an open forum and apologize for something that wasn't even his fault.
Bob humbled himself in front of us and we should be thankful for that. God bless a CEO for doing that and God bless Bob always.
Thanks Bob! I hope you guys are able to find a solution that works without having to spend a fortune.
Disclaimer: I hate using messaging, texting, and email for real communication. If anything that I type to you seems like anything other than humble and respectful, then I have not done a good job typing my thoughts.
Thank you for the kind words and prayers of support and encouragement.
(For example, I'm told that the storage system was back up relatively soon, but despite reporting health simply wasn't performing as it should. We don't yet know why.)
It turns out that we discovered a significant bug in the storage framework; that's what has led an embarrassing (but theoretically straightforwardly recoverable) hardware component failure to turn into a continuing nightmare: the system 'locked down' write access to the servers to protect data integrity even after the failed hardware was replaced, and we've been unable to get it out of this lockdown. (The services that are back up are on new hardware / systems, etc.)
Jim Straatman has posted more technical details at https://community.logos.com/forums/t/96630.aspx and I'm confident he and the team will be able to get it back up with the help of Inktank's engineers, now that the bug has been identified.
Email continues to be down -- giving me a strange 'how can I still be breathing? yet I am!' feeling -- so if you need to reach me feel free to use bobpritchett@gmail.com or a direct tweet @BobPritchett. (We have a five day inbound email buffer through our outside spam screener, so emails you've sent while we're down are queued up and will deliver once service is restored.)
Bugs, only entomologists like them.
I hope you have a service contract that covers economic loss or some form of compensation!
Many companies have Loss Insurance Policies in place. The thing is whether they want to use it or not. Using it more than likely means increased rates for life.
Unfortunately, it sounds that the company did not have a comprehensive service contract for the fault in question.
I am very sorry for the inconvenience to all of you.
-- Bob
Please accept our apology also. When we get frustrated and start throwing our toys around and make harsh comments we tend to forget we are part of the Faithlife community. We all enjoy the good times, and sometimes, struggle to patiently endure the bad times - which is part and parcel of this life.
Thanks to you and the team for the sacrifices made in this time.
Welcome John ... your spelling and typing is phenomenal for a week old infant ... [;)] Yes, I know what you really meant. It's clearly time for me to wrap up the forum reading for the night. Nice to end on a post that is gracious.
Orthodox Bishop Alfeyev: "To be a theologian means to have experience of a personal encounter with God through prayer and worship."; Orthodox proverb: "We know where the Church is, we do not know where it is not."
Whatever the technical investigation reveals, I am ultimately responsible to deliver the products and services you rely on, and I let you down.
You realize, of course, that this disqualifies you from ever entering American politics. No politician ever takes responsibility for his own failures, much less for those under his supervision [:D]
Thank you and your team for the dedication to customer service and for the hard work putting things back to normal. Also thank you for the explanation(s) of what happened.
Whatever the technical investigation reveals, I am ultimately responsible to deliver the products and services you rely on, and I let you down.You realize, of course, that this disqualifies you from ever entering American politics. No politician ever takes responsibility for his own failures, much less for those under his supervision
Thank you and your team for the dedication to customer service and for the hard work putting things back to normal. Also thank you for the explanation(s) of what happened.
Yeah show me a politician that takes responsibility for his mistakes and I'll show you a hobbit riding a unicorn while singing karaoke [:D]
Bob,
Thank you so much for your continued work on Logos. Having been with Logos since the very first product I have been more than pleased with the quality of service and quality of software.
I trust you will soon find all the remedies needed to make this great product fully functional again.
Blessings,
Rick Schaffner
Thanks Bob for the report, and I wouldn't say you "let us down" at all. Your team has been working overtime to get things back online for us, and I was still able to access my critical books for school and my critical Bible study tools. I had to change a few study habits during the meantime, but nothing that wasn't simple enough to do, and I'm just thankful for all the hard work you've all put in. Keep up the great work! Hang in there!
Nathan Parker
Visit my blog at http://focusingonthemarkministries.com
Well, today my Internet connection is very slow, and it turned out that the biggest Internet operator in Finland has problems in the overseas connections. They found that an excavator had cut an optical cable to Sweden. This means that missing redundancy somewhere in the network can happen to anyone [:(]
Gold package, and original language material and ancient text material, SIL and UBS books, discourse Hebrew OT and Greek NT. PC with Windows 11
So basically we have been sold (and payed good money for) a cloud based system that didn't actually have proper cloud infrastructure. Interesting.
Having an international cloud business with two database centers only 100mi from each other is hardly robust. The US has had power blackouts for whole regions for multiple days. While it's definitely not common, it does happen. When this happens you are not just affecting a customer base in the US but around the world. If you are really going to sell this as an international cloud based system then you should have redundancy in other parts of the world. Based on this experience it seems like I'm paying for one thing and getting something else.
I do really like the Logos software. However, I am finding it hard to justify upgrading to Logos 6 or buying other books and commentaries when there isn't proper infrastructure. Which is really disappointing since I was looking forward to both the upgrade and adding a couple commentaries this Christmas.
The US has had power blackouts for whole regions for multiple days.I understand your dissatisfaction but hyperbole does not help your point. Days??
It's not hyperbole. Where my cousins live in Connecticut they've have had blackouts lasting up to a week, in the winter no less, twice in the past couple of years. Parts of Long Island were without power for a week or more after Hurricane Sandy. A major earthquake in the Pacific Northwest (the anticipated "Big One") could cause widespread power outages for days.
On the other hand, Faithlife's response to this data center failure was to redouble their efforts to set up redundancy in more than one geographic location. They were already on a path to do that before this failure happened, and I'm guessing they are working as hard as they can to have it finalized before something like this happens again. It would be nice to hear an update on that.
The US has had power blackouts for whole regions for multiple days.I understand your dissatisfaction but hyperbole does not help your point. Days??
What is hyperbole about that? It is true!
macOS, iOS & iPadOS |Logs| Install
Choose Truth Over Tribe | Become a Joyful Outsider!
Regarding mobile devices and server failure: my solution is a tablet running Windows 8.1 so I can have Logos 6 offline. In my case, an Asus Vivo Tab Note 8 with 64 GB. The available GB may not be enough for those of you with massive libraries. For the rest of us, this solution answers the frequently posted frustrations with functions absent from the iPad (and I suppose from the Android).
In 2004, we had a series of hurricanes hitting just weeks apart that left us without power for two, two week increments. Lost power for two weeks, it came back up for 6 or 8 days, and then bam, back down again. In 1993, the blizzard of '93 (or where I lived in Florida - the no-name storm, and now called the storm of the century by many) came up almost out of nowhere with little warning, and did considerably more damage than anyone could have expected. Its effects were felt from Canada to South America, and its epicenter was about 3 miles north of where I lived at the time. We had six feet of water in our home, which was itself several feet off the ground. Coast guard boats traveled in our front yard, and over our dock. As we were swimming (!) to safety, we saw dolphins crossing the highway.
In NC where I live now, if it rains hard people forget how to drive and run over telephone poles, and all sorts of things. If it snows? forget it. Don't even leave the house; and don't plan on having power throughout the entirety of the snow fall either. 1/2 inch of snow and an entire state capital shuts down.
In Florida we were prepared for that - a couple generators, a couple weeks worth of food, and water. Propane to cook on, and so forth. The worst part was the heat.
There are plenty of examples like this people can point to as reasons for offsite back ups. There are so many eventualities though (giant super-volcano erupts killing millions, causing second ice age, asteroid strikes can cause extinction level event, and on and on and on), that they all can't be accounted for. What about low yield nuclear device being detonated over Colorado, from which the resulting EMP would leave america in the dark ages for quite some time (some estimates fall into the 10-20 year range). I guess my point in all this is that only so much can be done to protect and prepare for the unlikely. Sure logos could have a server farm in the countryside of every major continent on the planet. Wonder how much extra that would make our books cost?
If we listen to the alarmists amoungst us life as we know it totters around the balance point back and forth between extinction for mankind and our relative safety. With threats coming from every vector, and with such scary news selling better than peaceful news; at some point we have to accept that there are some eventualities that just aren't feasible to predict, plan, or prepare for and roll with the punches when they come.
This minor though annoying event (why is this still one of the top posts?) from two weeks ago now (?) maybe three weeks(?) - probably could have been prevented. Bob apologized, had a new plan in the works (redundant data center), and in light of what they learned, they made some changes to their setup to reflect a better (if not implicitly best) practice.
If the worst part of our day was that we had to preach from memory, or paper notes - then praise God for His mercies. If you weren't hungry, dying, or watching someone you love go hungry or die then praise God for His mercies.
Kudos Logos from learning from this. Keep up the good work.
L2 lvl4 (...) WORDsearch, all the way through L10,
There is nothing alarmist about suggesting that a backup be sited more than 100 miles away, or possibly in another continent. One doesn't need to compare that suggestion with canards about the hungry and dying either. Why the smear and the pseudo-spiritual appeals to extreme?
Setting up a robust cloud structure is expensive and complicated. People have made billions doing this "simple" thing. Since F.L. has made cloud-based features a part of their product offering, they will have gone through the process of rationalization, cost-benefit analysis and planning, and will also take new circumstances to account.
I'm also sure that Bob or someone at F.L. will reveal future developments, in due time.
Praying for you guys all weekend Bob. And thank you for being willing to take the blame, that says a lot about you as a CEO and as a Person.
davidtaylorjr.com