Between 8:20 a.m. and 9 a.m., Twitter completely crashed, leaving many users devastated and distraught. We desperately cast about for answers in the bittersweet shelter of other social networks, begging our Facebook friends and our Tumblr followers for an answer to the long-lasting outage. An error page brought no relief: “Whole server runtime (in this case Ruby engine) is down and web server send raw code to client browsers,” a helpful commenter attempted to clarify, but we still had no idea what % = reason actually meant. It was a rough 40 minutes.
The service came back up a few hours ago, but Twitter hadn’t explained itself until now. So what happened? Was it a hacker attack? Olympic overload?
Turns out it was a total data center fail–doesn’t have quite the same ring to it as “cascading bug,” but we’ll deal.
According to Twitter’s VP of engineering, Mazen Rawashdeh:
The cause of today’s outage came from within our data centers. Data centers are designed to be redundant: when one system fails (as everything does at one time or another), a parallel system takes over. What was noteworthy about today’s outage was the coincidental failure of two parallel systems at nearly the same time.
I wish I could say that today’s outage could be explained by the Olympics or even a cascading bug. Instead, it was due to this infrastructural double-whammy. We are investing aggressively in our systems to avoid this situation in the future.
Seems like a totally understandable freak coincidence. Perhaps you’d like to retract your “Fuck You,” Debby in Germany?