Data Center Storage Failure, 11/21

Page 1 of 3 (60 items) 1 2 3 Next >
This post has 59 Replies | 13 Followers

Posts 80
LogosEmployee
Jim Straatman | Forum Activity | Posted: Sat, Nov 22 2014 8:13 PM

A storage system that backs many core websites and services started failing at approximately 4:30 PM PST on Friday. The components that failed (hard drives & a RAID controller battery) have been replaced. However Ceph, the distributed object storage technology we use to run databases, application servers, and file storage requires a re-balance to heal. This healing process takes many hours, potentially up to 20, to distribute ~190 TB of data across 5 storage nodes (computers). The re-balance started around 1 pm PST, and once the re-balance completes, all systems will be back to normal.

We’re currently migrating core services off the failed storage system to expedite system availability. Commerce websites including Logos.com, Verbum.com, Vyrso.com, Noet.com, etc. should migrate within the next hour or two.

We’re taking the same steps with Proclaim to get it functioning in time for Sunday morning service.

Ceph, and other central storage technologies are common place in data center deployments. They are essentially many-eggs-in-one-basket architectures that are vulnerable to large scale failures. To reduce the impact of hardware failures, we’re researching decentralized storage options that minimize the dependence of core systems on any one component.

We’re also building a redundant data center deployment in south Seattle. This will enable rapid fail over when a system wide event occurs. Unfortunately, that second data center is several weeks/months from completion.

We’re very sorry for the inconvenience caused by this event, and appreciate your continued patience and support as we work to resolution. Know that we appreciate your business and are making every effort to improve system availability.

Posts 2764
Erwin Stull, Sr. | Forum Activity | Replied: Sat, Nov 22 2014 8:21 PM

Jim Straatman:

We’re also building a redundant data center deployment in south Seattle. This will enable rapid fail over when a system wide event occurs. Unfortunately, that second data center is several weeks/months from completion.

This was going to be my question.

I know building a redundant datacenter can be very costly, but considering the growth and dependency, it may be time to do this.

Posts 787
James Hiddle | Forum Activity | Replied: Sat, Nov 22 2014 8:44 PM

Jim Straatman:

A storage system that backs many core websites and services started failing at approximately 4:30 PM PST on Friday. The components that failed (hard drives & a RAID controller battery) have been replaced. However Ceph, the distributed object storage technology we use to run databases, application servers, and file storage requires a re-balance to heal. This healing process takes many hours, potentially up to 20, to distribute ~190 TB of data across 5 storage nodes (computers). The re-balance started around 1 pm PST, and once the re-balance completes, all systems will be back to normal.

We’re currently migrating core services off the failed storage system to expedite system availability. Commerce websites including Logos.com, Verbum.com, Vyrso.com, Noet.com, etc. should migrate within the next hour or two.

We’re taking the same steps with Proclaim to get it functioning in time for Sunday morning service.

Ceph, and other central storage technologies are common place in data center deployments. They are essentially many-eggs-in-one-basket architectures that are vulnerable to large scale failures. To reduce the impact of hardware failures, we’re researching decentralized storage options that minimize the dependence of core systems on any one component.

We’re also building a redundant data center deployment in south Seattle. This will enable rapid fail over when a system wide event occurs. Unfortunately, that second data center is several weeks/months from completion.

We’re very sorry for the inconvenience caused by this event, and appreciate your continued patience and support as we work to resolution. Know that we appreciate your business and are making every effort to improve system availability.

Ok that went way over my head. Too technical for my taste but from what I read is that logos is down and will be up when you fix the problem.

Which would be nice!

Posts 281
Sean McIntyre | Forum Activity | Replied: Sat, Nov 22 2014 8:52 PM

Jim Straatman:
We’re taking the same steps with Proclaim to get it functioning in time for Sunday morning service.

It is Sunday morning in Kenya. I don't appreciate being a second class citizen to my American Coleagues

Jim Straatman:
Know that we appreciate your business

Your actions do not say so. The last time that there was an issue and the forums were down, I complained that you did not communicate the issue via social media or respond to my enquiries via the same. Your handling of this issue shows that my complaint was not taken seriously and that you have no protocols in place for communicating with your customers during such an event as this. Don't remind me that you responded via twitter because it was not to my tweet but to the latest ones and after many hours had passed.

Recently I purchased L6 and at no point was I given any clue that there was an issue with Yosemite. 

I do not feel valued as a customer and talk is cheap. However, since you did nothing to change last time, I guess that nothing will change this time.

Posts 787
James Hiddle | Forum Activity | Replied: Sat, Nov 22 2014 9:00 PM

What caused the failure to begin with?

Posts 80
LogosEmployee
Jim Straatman | Forum Activity | Replied: Sat, Nov 22 2014 9:25 PM

Sean,

My apologies for my comments regarding Sunday morning and our timezone difference.

You are right, we should communicate better when things like this happen. It can be difficult to broadcast information in all the appropriate channels, especially when our attention is focused on resolving the problem. We're currently building a status dashboard to expose key performance metrics and information when an event like this occurs. Something similar to what you might find at Amazon Web Services or GitHub.

Posts 26532
Forum MVP
MJ. Smith | Forum Activity | Replied: Sat, Nov 22 2014 9:38 PM

James Hiddle:

What caused the failure to begin with?

The hard drives (and a battery) failed just as they do on your home computer or laptop - wear, heat, manufacturing flaw ... all the things you think of at home.

Orthodox Bishop Hilarion Alfeyev: "To be a theologian means to have experience of a personal encounter with God through prayer and worship."

Posts 281
Sean McIntyre | Forum Activity | Replied: Sat, Nov 22 2014 9:44 PM

Jim Straatman:
My apologies for my comments regarding Sunday morning and our timezone difference.

Thanks. I appreciate that you are taking steps to improve and a status page would go a long way towards reassuring me in the future. I assume that it would be insulated somehow from everything else. One frustration for me was that the forums went down leaving me completely in the dark. 

Posts 80
LogosEmployee
Jim Straatman | Forum Activity | Replied: Sat, Nov 22 2014 9:52 PM

Yes, Sean, we'll host it somewhere completely separate from our production infrastructure. 

Posts 2
John McGovern | Forum Activity | Replied: Sat, Nov 22 2014 11:11 PM

God bless you Jim and the IT team at Logos.  I know these events can be physically and emotionally exhausting.  Hang in there!

Posts 80
LogosEmployee
Jim Straatman | Forum Activity | Replied: Sat, Nov 22 2014 11:52 PM

All Proclaim databases have migrated to a more stable deployment and are back online. We're currently adding additional web servers to accommodate peak load.

Posts 248
Patrick Rietveld | Forum Activity | Replied: Sun, Nov 23 2014 1:00 AM

John McGovern:

God bless you Jim and the IT team at Logos.  I know these events can be physically and emotionally exhausting.  Hang in there!

Amen.

Posts 248
Patrick Rietveld | Forum Activity | Replied: Sun, Nov 23 2014 1:21 AM

MJ. Smith:

... all the things you think of at home.

Yes, you put things in a better perspective. Thanks!

I think the thing that makes us easily frustrated/angry is that we can't control these things anymore. We are so depedent on others that when something doesn't work as we want to.

Just came back from Tanzania for a workvisit. Everytime I am in Africa I realise that our abilities to control our life are much more limited then we think. Much more than just webservers. Turning back to stone tablets won't do the trick. We (or maybe it is just me) take much more for granted than we realize.

Suppose the power fails today? Suppose we can't shower us in clean drinkwater anymore (or drink it)? What to do if candles are finished? What to do when the gasbottle is empty and the shops don't have none (until nobody knows)? What to do when there is no good wokring public service for rubbish (like printer toners, toxic fluorescent bulbs, polypropyleen bags, shaving foam cans)? Just throw it on the street? Just burn it? Just bury it?

What to do if the bread that is sold is just another form of powder? What to do if the bus to the next city three hours down the road doesn't show up? What to do if it shows up but breaks down after one and a half hour?

I find it scary that I can't do much myself. I am scared that I depend so much on others. Frustrated, since so many times 'the others don't do as I like'. Scared, since people would think that also about me.

But it is also a miracle that many times it works! A miracle and a blessing: we don't need to do everything ourselves. We have each other. So let us enjoy and celebrate that.

This is the WORLD the Lord has made. We will rejoice and be glad in it.

Posts 80
Rick J. | Forum Activity | Replied: Sun, Nov 23 2014 1:26 AM

John McGovern:

God bless you Jim and the IT team at Logos.  I know these events can be physically and emotionally exhausting.  Hang in there!

Amen

Posts 80
LogosEmployee
Jim Straatman | Forum Activity | Replied: Sun, Nov 23 2014 2:18 AM

Community APIs that support Faithlife.com & groups functionality within various applications has been restored. We're currently working on alternative infrastructure for the commerce engine to get Logos.com, Verbum.com, Noet.com, & Vyrso.com online.

The storage rebalance is about 50% complete. We'll continue to migrate core services off the failed storage system while it's recovering.

Posts 78
Rob | Forum Activity | Replied: Sun, Nov 23 2014 2:54 AM

Jim Straatman:

We’re also building a redundant data center deployment in south Seattle. This will enable rapid fail over when a system wide event occurs. Unfortunately, that second data center is several weeks/months from completion.

Jim, now is the time for some serious reflection at Faith Life. As someone alluded to elsewhere, your systems are expected to be up 24/7. Faith Life is now a global service and anything less than 24/7 is out of the question.

In your situation I wouldn't be able to sleep at night knowing that you have a single data center in Seattle. As far as I'm concerned, a second data center in Seattle is not a solution. Does the earth not shake in Seattle? How about a Tsunami? You need to be looking at east or midwest.

Personally, I work in a small company where it is my responsibility to keep a website and company infrastructure up 24/7. Our customers may only be numbered in the thousands but they are are all over the globe. I have found, as well as it being my own expectation, that people who have been sold a product or service expect it to be available to them when they want it and when they need it. The significance of any outage goes way beyond the financial. It affects our reputation and how the customer sees our dependability.

I know that you guys work hard and I know how overwhelming the technology can be. I just wanted to voice my concern.

Posts 2
Ralph Tyner | Forum Activity | Replied: Sun, Nov 23 2014 3:06 AM

I am thankful that I am able to use Logos 6 on my Mac and that resources that were downloaded prior to the current problems appear to be accessible and useable. Once the server is back to full capacity, I wonder if you might also address a question I have about the Logos mobile app (or refer it to the appropriate IT person). I had downloaded a number of resources to both my iPad and iPhone for offline use and although the apps now show me as signed in to my account at logos, none of the resources appear in my library on either device.  I know that there are a lot of people who rely on the software more than I do, but I'm guessing that others have had similar problems with their mobile Logos. Thanks so much for your hard work. Best wishes for continued progress.

Posts 1281
toughski | Forum Activity | Replied: Sun, Nov 23 2014 3:23 AM

Jim Straatman:
A storage system that backs many core websites and services started failing at approximately 4:30 PM PST on Friday.

Logos.com and Vyrso.com are still down at 5:22 CST on Sunday, Nov 23rd

Posts 2061
GaoLu | Forum Activity | Replied: Sun, Nov 23 2014 3:46 AM

Think of all the free books we will get when it comes back up!

Posts 5
Fred | Forum Activity | Replied: Sun, Nov 23 2014 3:52 AM

Ralph Tyner:

I am thankful that I am able to use Logos 6 on my Mac and that resources that were downloaded prior to the current problems appear to be accessible and useable. Once the server is back to full capacity, I wonder if you might also address a question I have about the Logos mobile app (or refer it to the appropriate IT person). I had downloaded a number of resources to both my iPad and iPhone for offline use and although the apps now show me as signed in to my account at logos, none of the resources appear in my library on either device.  I know that there are a lot of people who rely on the software more than I do, but I'm guessing that others have had similar problems with their mobile Logos. Thanks so much for your hard work. Best wishes for continued progress.

This was my experience exactly. Resources I download for offline use should be available when I want them. A dependency on network availability and availability of your servers (as certainly appears to be the case) should not exist for this use case. Kindle & iBooks work fine without connectivity. I simply love Noet but my confidence has been badly shaken. Thanks, and good luck with the continuing mop up.

Fred

Page 1 of 3 (60 items) 1 2 3 Next > | RSS