Manage risk by managing for resilience

7 min readApr 20, 2023

Risk assessments — a staple activity within the current paradigm of managing risk. The idea is simple: identify how much risk we carry by measuring two parameters — impact and likelihood. Risk is the product of those two metrics which is calculated by using a risk matrix containing a very a simple algorithm. In its simplest manifestation low impact and low likelihood result in low risk; high impact and high likelihood raises the overall risk to high. Between these two values we have medium risk that can be the result of low impact high likelihood (or vice versa). For visual impact, the matrix uses colours to substitute the words: green for “low”, amber for “medium” and red for “high”.

Of course, this simple risk matrix may not cover all the available options needed to give a more accurate representation of risk. Therefore, there is a tendency to give a numerical rating for the impact and likelihood values. For example, low = 1 and high = 4, which means there are 16 possible values instead of the previous 4, with the lowest risk rating of 1 (low impact and low likelihood) and 16 (high impact and high likelihood). Often the highest and lowest values are given a new label such as “very low” for low/low and “critical” for high/high. This provides another level of granularity to help provide a risk score that is considered to be more representative of the calculated risk. But is this enough? Are there some critical risks more critical than others? Should the matrix scoring be higher? If so, how high? Is 5 enough? Or 6? Or 10? The exponential growth may provide the requisite variety, but it also makes the matrix more complicated to use.

The Challenge

Do these matrices actually give a true representation of risk? How would we know? In cybersecurity, it is usual practice to prioritise the higher risks over the lower risks. But does this translate to improved security? Again, how would we know? For me, there are several weaknesses in managing cybersecurity risks this way. First, the scoring of likelihood and impact is often subjective. There is rarely any empirical data to help measure these values, which are therefore based on making assumptions. Second, there are too many unknowns to calculate the likelihood of an exploit occurring and the impact an exploit could have. Third, what is risk affecting? Is it financial loss? Is it the effect on reputation? Is it customer confidence? Is it personal risk or organisational risk? Risk is potentially damaging but it seems to me to be almost impossible to quantify as I will try to articulate below.

Likelihood

Determining likelihood is normally calculated on criteria associated with the ratio between reward and effort for an attacker. The greater the reward, the more likely a weakness will be exploited, conversely, the greater the cost of the effort for the attacker, the less likely the weakness will be exploited. For example, an exploit that has been documented but remains unpatched is more likely to be exploited than a zero-day vulnerability. Financial institutions are more likely to be exploited where the monetary gains are high, rather than non-financial institutions. The prevalence of threats is also a factor in determining the likelihood. For example, the rise in crypto-mining has increased the likelihood of resources being exploited to mine for cryptocurrency. There is also an increase in the number of ransomware attacks which makes this type of attack more likely. However, the real challenge is in how to quantify the likelihood. At a moment in time, the likelihood that a weakness will be exploited is difficult to predict. What assumptions are being made about the cost and reward ratio? If the likelihood is low, can it be ignored? If it is high, what does it mean? Is the likelihood measured in temporal terms (number of days, weeks, months?) or is it measured based on the here and now? By what means can likelihood be measured?

Impact

Measuring impact is difficult to do because it requires an understanding of the interrelationships between entities within the organisation or system. Digital products and services are becoming more complex. Microservices, APIs, third-party components, mergers and acquisitions, and new technologies have made it difficult if not impossible to have a full picture of the architecture of the system. One component can have links to other components either directly or, more commonly, indirectly. Within chaos theory, it is suggested that a butterfly flapping its wings in Brazil can cause a tornado in Texas. By the same token, a vulnerability exploited in one minor part of the system could have a catastrophic and unpredictable effect on other parts of the system. How feasible is it to determine the impact on a system that has built-in complexity and unknowable outcomes from a given state?

Risk

A key challenge when assessing risk is the dynamic nature of its impact and likelihood — these values could change over time: the longer a weakness exists, is it more likely to be exploited? Or, as the threats shift, does the likelihood change? Due to the complex and uncertain nature of systems, how is impact determined? The blast radius may unknowingly increase or decrease based on changes to the system. Therefore, risk is a dynamic value, changing constantly. A low risk today is a high risk tomorrow; a small blast radius one day may be system wide the next day. We can only assume what the future looks like, based on our understanding of the changing cybersecurity landscape, but we cannot predict or quantify it.

Another challenge in managing risk is that preventative measures to protect the organisation from incidents assume that robustness is the only requirement for managing cybersecurity risk. Robustness implies an impenetrable stronghold. It creates an air of invincibility that is dangerous for the organisation. Complacency sets in, based on the premise that all risks have been identified, quantified and managed accordingly. Product delivery teams continue their work with a protective blanket covering them. But I believe this is false hope. Incidents will happen. But without the skills to manage the fallout from these incidents, organisations are found wanting sometimes with catastrophic consequences. Mitigations such as a disaster recovery plans, which are documented, signed off and archived, have no substance. Blame games ensue as stakeholders with the most to lose seek out scapegoats who failed to recognise the risk and protect their valuable assets. Managing risk now involves limiting damage, unquantifiable damage affecting customers, employees, shareholders and suppliers.

Resilience

If we are unable to accurately quantify risk, how do we manage risk? An often misquoted statement from Deming says “it is wrong to suppose that if you can’t measure it, you can’t manage it — a costly myth”. Indeed, I believe that managing risk means being resilient to risk. We should start with the assumption that something bad will happen at some point. The key to this strategy is to foster a culture that can adapt to unplanned and unpredictable occurrences, can learn from failures, and can flex to meet uncertain conditions. Changing behaviour to actively seek potential problems, to deliberately create failing conditions, and to proactively stress the system, builds the resilience needed to maintain the viability of the organisation in the most challenging of circumstances. Donald Schön wrote that bringing a past experience to a unique case brings familiarity to an unfamiliar situation. Therefore, it is essential to amplify requisite variety for managing uncertain conditions by building experiential learning within the organisation. This creates the conditions for building the resilience needed to manage risk as it unfolds rather than managing risk through mitigating actions prioritised based on assumptions and predictions. When an incident occurs, the resilient organisation is better prepared to work through the incident, even if it has never seen it before, than the organisation that has tried to manage risk through preemptive mitigation (or through risk avoidance or through risk reduction).

Managing risk

I am not suggesting that we stop using risk assessments to manage risks. Quite the opposite. We need to combine assessing risks with techniques for building resilience. I posit that we need to turn our assumptions derived from conducting risk assessments into theories that provide the basis for experimentation. If we know what the steady state is, based on normal behaviour, does a change to the system create risks? If we apply variables to the system based on the output of a risk assessment, does the steady state persist? For example, imagine we find a credential that we have assessed to be overly permissive. What happens if we change the credential to make it less permissive? Is steady state maintained? Or does a component unexpectedly fail? If so, how critical is the failure to the system? Is the system able to recover from the unexpected failure through applying countermeasures? If so, a potential risk to the system has been mitigated by reducing the permissiveness of the credential. If not, the risk to the system can be measured based on the impact (the systems affected). The likelihood value is rendered moot in my opinion and instead we should assume it is plausible that the unexpected state could occur and risk is measured on how resilient the system is in recovering from this plausible state.

Running this type of experiment, often based on the risks that have been documented, help organisations manage their risks and, at the same time, build in the resilience required to recover from many different scenarios that may or may not have been encountered before. Robust systems are never infallible, but resilience reduces the overall risk to the organisation by reducing the impact of failure in all plausible situations.