Downtime April 13th 2014

GHDpro

Administrator
Staff member
The site has just recovered from downtime longer than I can remember (in recent history anyway).

The problem started about 12 hours ago when the Linode.com's Newark facility lost power and caused their servers to go down. The Emulator Zone is hosted on a VPS at Linode.com in that particular facility.

After about 3-4 hours they finally fixed the problem and all servers were coming back online. At this time the site was back online and I quickly logged in to check if the sudden power loss didn't cause any issues (with for example, the database). I also noticed a bunch of server packages were out-of-date, so I updated them and for good measure soft-rebooted the server to make sure all updates were loaded.

Big mistake.

Turns out the physical node the site was hosted on was having issues that caused it not to load the server image for the site. It wouldn't load the "Rescue Mode" image either, so I quickly figured out I needed to open a ticket and shortly after that the site was moved to another physical node, so I hoped the problem was solved.

Guess not.

The site's server image was still not loading, while the "rescue mode" image now was. So now I was thinking, maybe the server image is damaged and won't load. So I try restoring a backup from before the outage. That didn't load either. Damn. About this time a notice appeared on Linode's control center that said something like "we are aware of an issue with this node, please do not open any ticket, just wait".

So I did.

So after a few hours of waiting I got impatient and asked support for an update on how long it is still going to take. What turns out to be the problem is that when the site was moved to another physical node, it was moved to one with the exact same issue as the one I was being moved from. Doh! So the support agent initiated another move to another server.

And 3rd time's a charm... the site is back!

TL;DR: power outage caused a few hours of downtime, and miscommunication about what the issue was that was keeping the site from coming online caused a couple of more hours of downtime that could have been avoided.

My apologies for the inconvenience.
 
Top