After being down for 11 hours Monday, Foursquare posted a lengthy, technical post-mortem explaining the outage, their embarrassment and how they had planned to prevent future breakdowns. A few hours after this mea culpa, the site was down again.
Foursquare was back up by shortly after midnight, the second day in a row the startups small team of engineers have worked the graveyard shift powered by Red Bull and a flood of angry tweets from users.
The simple explanation for the problems is that Foursquare stores its data, like user check ins, across a distributed network of what it calls “shards”. On Monday, one shard started to receive a disproportionate number of check-ins and became unstable. To help, Foursquare brought a new shard online, but this ended up really flummoxing the system. To avoid losing any user data, like those precious mayorships, 4square shut the whole site down.
The silver lining on the second outage in as many days was that the new communication strategy Foursquare put in place was a big help. The Twitter account @4sqSupport kept a constant stream of updates and responded directly to questions from frustrated users and third party developers.
Good thing Twitter is so reliable.