Losing Facebook, WhatsApp, and Instagram for several hours on Monday was inconvenient, damaging to businesses, and in some cases, almost catastrophic. According to Facebook, it was all due to configuration changes to its network coordinating routers. It’s a reasonable explanation, but the fact that a single error like that could bring not just Facebook but other Facebook-owned systems grinding to a halt is a bit alarming. One wrong router config change caused multiple services, and even VR headsets, to stop working entirely. On top of that, by Facebook’s own admission, it also had a cascading effect on how the company’s data centers communicate, bringing all their services to a halt. “The reliance on interconnected systems does carry with it an inherent risk of system or even service failure,” said Francesco Altomare, senior technical sales engineer at GlobalDots, in an email interview with Lifewire, “To counter this daunting risk, companies utilize the principle of SRE (System Reliability Engineering), as well as other tools, which all deal with varying levels of redundancy built into every layer of a system’s infrastructure.”
What Can Go Wrong
It’s worth noting that when a system like that fails, it usually requires a perfect storm of things going wrong. It’s less like a house of cards waiting to fall and more like an exposed thermal exhaust port on a space station the size of a small moon. Most companies take steps to try and ensure that the one thing that could throw everything into chaos never happens—but regardless, it can happen. “Unexpected failures are a part of business and could arise as a result of worker negligence, faults in internet service provider’s network, or even cloud storage services undergoing issues,” said Sally Stevens, co-founder of FastPeopleSearch, in an email interview. “…As long as the necessary steps to protect the system—such as backups, on-site router, and tiered access—are put in place, these failures are quite unlikely.” Though even with an army of fail-safes, it’s still possible for the lynchpin to fail. If the system that controls things like primary forms of contact, appliances, doors, etc., fails, the results can be significant. From mild inconvenience to full-on catastrophic, depending on how much individuals and companies rely on it all. “There is also the risk of hackers getting into the system from any of the least protected devices, such as refrigerators and oven toasters,” added Stevens, “which could lead to data theft and ransomware.”
How We Can Prepare
There’s no way to guarantee that a system will never fail, but there are steps that can be taken to either make failure less likely or to address failure more smoothly. A combination of the two approaches that marries fail-safes and countermeasures with contingency plans and backup systems would be ideal. “For eliminating these hazards created by third-party products and services that are effectively handled, roles and duties regarding Third-Party Risk Management must be strictly outlined,” said Daniela Sawyer, founder and chief technology officer of FindPeopleFast, in an email interview, “To flourish in these new surroundings, risk managers must grasp the essential parts of such a sophisticated ecosystem.” What happened with Facebook, WhatsApp, and Instagram was unfortunate, but also hopefully eye-opening. People who rely on interconnected systems must understand that the right thing going wrong can disrupt everything. And measures must be put in place (or scrutinized and refined) to make such disruptions less likely and less impactful. In Facebook’s case, its problem wasn’t the router troubles, but rather having almost its entire ecosystem connected to everything else. Thus, with Facebook (the service) down, Facebook (the company) had to spend much more time and energy simply organizing and addressing the issue. If it either didn’t use such a deep-rooted, interconnected system or had backup plans in place to deal with an outage like that, it likely would have taken far less time to fix.