(NEXSTAR/AP) – Facebook, as well as subsidiaries Instagram and WhatsApp, all suffered an outage Monday that was longer than four hours (and counting) at publication time. What caused the social media meltdown?
“We are experiencing networking issues and teams are working as fast as possible to debug and restore as fast as possible,” said Facebook Chief Technology Officer Mike Schroepfer on Twitter at 12:52 p.m. Pacific Time – three hours into the outage.
Facebook hasn’t opted to elaborate further, as of 1:30 p.m. Pacific Time.
Doug Madory, director of internet analysis for Kentik Inc., said it appears that Facebook withdrew “authoritative DNS routes” that let the rest of the internet communicate with its properties.
Such routes are part of the internet’s Domain Name System, a key structure that determines where internet traffic needs to go. DNS translates an address like “facebook.com” to an IP address like 18.104.22.1680. If Facebook’s DNS records disappeared, apps and web addresses would be unable to locate it.
Jake Williams, chief technical officer of the cybersecurity firm BreachQuest, said that while foul play cannot be completely ruled out, chances were good that the outage is “an operational issue” caused by human error.
Madory said there was no sign that anyone but Facebook was responsible and discounted the possibility that another major internet player, such as a telecom company, might have inadvertently rewritten major routing tables that affect Facebook. “No one else announced these routes,” said Madory.
Get daily news, weather, breaking news and alerts straight to your inbox! Sign up for the abc27 newsletters here
Computer scientists speculated that a bug introduced by a configuration change in Facebook’s routing management system could be to blame. Colombia University computer scientist Steven Bellovin tweeted that he expected Facebook would first try an automated recovery in such a case. If that failed, it could be in for “a world of hurt” — because it would need to order manual changes at outside data centers, he added.
“What it boils down to: running a LARGE, even by Internet standards, distributed system is very hard, even for the very best,” Bellovin tweeted.