On October 4, 2021, a faulty configuration change impacted 3.5 billion people.
For approximately six hours that day, users of Facebook, Instagram, and Whatsapp were clicking the refresh button on their browsers or attempting to load their mobile apps, only to find that Facebook had, quite literally, disappeared from the Internet.
In this case, during a routine configuration change to assess the capacity of Facebook’s backbone network, a series of unintended events cascaded into the final result of Facebook’s environment vanishing. All the systems were still “live,” but no one — not individual users, nor business customers, nor even Facebook engineers — could connect to it.
Facebook has learned some important lessons on balancing security, availability, and recovery. This was similar to the case of Amazon and their AWS cloud services when a similar human error caused a nearly two hour outage in 2017. Organizations watching major events like these unfold from the sidelines have an opportunity to reflect on their own risks — specifically the systemic risks that occur in an age of heavy reliance on third party providers.
The Internet is an interconnected system that allows devices to communicate with each other. As with most complex systems, it once started off modestly, with few systems and even fewer interdependent systems. As the Internet matured and entire new industries appeared around it, the system reached the point where any business with at least one computer will have some reliance on a tech company that introduces some level of systemic risk.
Systemic risk (or more specifically, systemic cyber risk) is the possibility of an adverse event that impacts one critical system and cascades down to affect many other systems. Said differently, it is the risk introduced by the third-party technology products you use or outside providers you work with. Given that cloud computing and software-as-a-service products are now the norm, the existence of systemic risk demands that every business reckon with a key question: how do you assess your technological dependencies so that you can evaluate your organization’s level of risk?
There is a saying in security that always stuck with me:
“Defenders think in lists. Attackers think in graphs. As long as this is true, attackers win.”
Most organizations will assess systemic risk by putting together a simple list of third-party vendors they use. Good start. Some may go the step further to document how each vendor is used and the type of data that is shared with them. That’s a bit better. But to truly understand systemic cyber risk, you have to take it a step further. You have to think in graphs.
Thinking in graphs means adding multiple dimensions to the mental model of your system. Start with your same list of vendors and service providers — but from there, take it from 2D to 3D. For each area of your business, do the following:
The software and services that your organization is dependent on to function. This should include key vendors, SaaS applications, and components of core systems (e.g. the software components and dependencies of your infrastructure). For example, if you provide your clients a website to purchase your product you may include your website hosting provider, your payment card processor, etc.
Criticality/reliance of service — can you survive for a period of time if it went down? The rating system is your choice, I prefer to use a color-coded chart based on the criticality:
I find that starting with the list created in the prior step and drawing lines to how everything is connected provides a better sense of the interconnectedness of systems. It will be messy at first, and that’s okay. As more things connect into systems (e.g. more lines go to an item in your list), update the criticality. It means that more of your systems are dependent on it.
With a better understanding of your critical systems and how much of your business is dependent on it, you can begin creating contingency plans to address failures in services that you don’t directly own or manage.
Below is a simple example of what a graph could look like for a basic eCommerce website. Each component is color-coded with the criticality based on the significance of an interruption.
For your exercise, you’ll want to get granular and dig into what reliance each system has on each other and the cascading effects it could have. For instance, pose the question: could you still process orders if...
...your ERP system went down? Possibly, depending on your system, but you may not have records of transactions, so it could be too risky.
...your customer support platform went down? Yes, but it might be an inconvenience for your users.
...you lost connectivity with system monitoring services? You may panic thinking your systems are down, but orders would still roll in
Facebook’s recent live-fire disaster recovery scenario has provided an opportunity for companies to see the impact one vendor or product going down can have on a large scale.
This graphing exercise is useful in plotting out the different components of your system and seeing the interdependencies of them — if nothing else it is a thought experiment through which you can switch your mindset from a list of dependencies to a graph of dependencies. With that new mindset, you are in a better position to create a contingency plan to react quickly when one or more of those subcomponents goes down.
Tech companies are not too big to fail. Even with redundant systems and backups, the companies that we rely on every day to support our systems are susceptible to outages that can and will impact your business. You must be prepared accordingly.