Why should you do (security) chaos engineering?
It seems to me that the notion of chaos engineering is gaining ground — I see more and more references to chaos engineering in social media and business environments. However, there are a few prevailing myths around chaos engineering that may hinder an organisation’s appetite for exploring chaos engineering. With the imminent release of Kelly Shortridge’s book on security chaos engineering (Shortridge, 2023, O’Reilly), there’s potential for a new way of improving security of our organisations and of the products and services they deliver. In this post, I want to explore the reasons for using (security) chaos engineering ((S)CE) within an organisation and hopefully at the same time, bust some myths that I have come across.
Perhaps the main concern within the leadership teams of organisations is the use of the word ‘chaos’. The Oxford English Dictionary defines chaos as ‘complete disorder and confusion’. It conjures up images of people randomly disrupting services without any consideration for the consequences of their actions. As a result, organisations are reluctant to engage in a practice that seems anathema to the orderliness and predictability that senior executives desire. However, the use of the word ‘chaos’ obscures the main essence of what chaos engineering means — it relies on a disciplined approach involving planning, execution and analysis. Furthermore, by introducing chaos engineering into the organisation, individuals and teams start to build in ways to manage the volatility and uncertainty in which organisations operate.
Another myth that I hear is that (S)CE is dangerous because an experiment could disrupt critical systems causing financial loss and reputational damage. I posit that by introducing (S)CE into an organisation, the risk of disruption to an organisation is reduced over time. Returning to the point raised in the previous paragraph, that by planning and executing an experiment it should be possible to predict with some level of certainty the impact of the experiment. However, this is dependent on there being a reasonable level of understanding of how the target system works. Care should be taken when applying (S)CE to a system that is volatile and unpredictable. If the organisation has this level of uncertainty, it is advisable to stabilise the system before running experiments. Not only will this enable (S)CE, it will improve the reliability and robustness of the target system. If every experiment is likely to cause major problems, it is arguable that either the experiment has a too large blast radius, or the organisation is too fragile. Of course, if it’s the latter, then the organisation is operating with high stakes in a high risk setting.
Being reliable and robust is not the same as being resilient. In my opinion, this is another myth worth exploring. The reason for doing (S)CE is not to build robustness and reliability into the system, indeed, there are other frameworks and methodologies that are enablers of this such as quality control. Robustness and reliability suggests failure is not an option. But this phrase also implies that failure is not a choice — it is not guaranteed that a system will never fail. Resilience, on the other hand, provides opportunities to manage unexpected outcomes (that is, failures) with greater efficiency and efficacy. What I mean by this is that resilience allows people within an organisation to bring past experiences to bear in a unique situation (Schön, 1991, p. 138). Thus, running experiments and seeing unfamiliar outcomes enriches our experience to see unfamiliar situations as familiar ones and provide opportunities for past experiences to influence and deal with unique situations (Schön, 1991, p. 140). With this in mind, it is important to reflect on whether an organisation is resilient to the volatile world in which it operates. Using (S)CE is a means to generate and maintain resilience. I’ve encountered opinions that desktop disaster recovery exercises and a full disaster recovery policy builds resilience. But I believe the infrequent nature of such exercises, and the limited scope of the parameters involved in these exercises (the lack of requisite variety), do not provide opportunities for an organisation’s continual learning or build operational capabilities to manage uncertainty within the organisation; embedding (S)CE into daily operational activities does.
Starting out on the (S)CE journey can be daunting. However, my suggestion is to experiment with experiments! In other words, a good approach is to build a theory that doing (S)CE will allow organisations to identify unknowns within the system. Therefore, I suggest starting with a small experiment that allows team members to understand how (S)CE works. The experiment could be as simple as removing write access from a local account that should only read data and then perform a task that uses the account under normal circumstances. If the function breaks, then it suggests that an account that should only have read-only access is writing data somewhere. The key to running such a simple experiment allows team members to reflect on how the experiment went. Was the theory clearly described? Was the experiment effective in proving / disproving the theory? Does the experiment help the team to understand (S)CE? I’ve often heard engineers suggest taking out a whole cloud region to see what happens. The reality is, this is not an experiment. There is no theory — “let’s see what happens” is not the basis of a theory for experimentation. The theory needs greater granularity.
I titled this post by asking why should you do (security) chaos engineering? In conclusion, I suggest that by doing (S)CE, organisations build capabilities that provide opportunities to learn about the system, reduce uncertainty and increase the organisation’s resilience to unforeseen and unfamiliar situations. Starting with a simple experiment provides an opportunity for learning how to do (S)CE and set the organisation on its journey into chaos engineering.
References
Schön, D. (1991) The Reflective Practitioner: How Professionals Think in Action, Routledge
Shortridge, K. (2023) Security Chaos Engineering: Developing Resilience and Safety at Speed and Scale, O’Reilly