Why security chaos engineering works, and how to do it right
For many, safety isn’t the first thing that comes to mind when they hear about Chaos Engineering. It’s likely that even fewer would consider it a basic security practice on par with things like network firewall configuration, identity management, and intrusion detection.
However, the growing complexity of modern software security layers alongside increasingly modular and distributed architectures has reached a point where the risk of failure validates the legitimacy of chaos engineering as a security tool. As such, it’s not unlikely that chaos engineering will intrude into the realm of not just routine – but essential – security management processes.
Let’s explore the reasons why chaos engineering is gaining traction as a security management method, detail how it’s applied in security scenarios, and consider some of the best practices to follow when putting it into practice – including some of the most common pitfalls, which should be avoided.
The role of chaos engineering in security
Chaos engineering is a broad term that describes performing complex system tests by injecting errors before an application encounters them in normal operation, monitoring the results, and documenting the correct course of action. The concept of chaos engineering is often applied to operational hardware – including networks and server pools – as well as software development and product testing.
Chaos might not sound like something a security specialist or compliance team would want to cultivate in their software systems. However, the goal of chaos engineering is to avoid chaos by identifying unseen problems and potential failures before they occur in production. And as the practice matures, chaos engineering is gaining more attention in the area of application security.
By performing chaos engineering directly on the security planes, security professionals gain the ability to expand the number of situations and attack vectors they can simulate. In addition, they can test how the relationships between each layer and feature affect the impact of a particular failure. Ultimately, this will reveal areas where layers of security cannot provide an effective barrier against attacks and intruders.
Applying Chaos to Security: Injections and Surveillance
Chaos engineering security testing is a matter of balancing two levels. One layer handles bug injection; On the other hand, the monitoring and settlement processes take place. For example, one layer will feed in test data to simulate unauthorized access attempts. The other layer identifies problems by looking for security breach signals, allowing security teams to pinpoint gaps in access controls.
If these injections cause a failure or uncover a hole in existing security barriers, the monitoring process should identify the exact time and point at which the problem or breach occurred. Logs and monitoring data from the application infrastructure side, along with the injected security-based errors log, also help to correlate infrastructure-related issues that may pose a security threat.
While it’s possible to apply chaos engineering to security and infrastructure separately, that would probably be a mistake. Security breaches can arise not only from unexpected events indirectly related to security or threat prevention tools, but also as a result of events in the IT infrastructure. For example, infrastructure failures often result in systems running in a “failure mode” that may not damage functional elements. Instead, it can provide a potentially unwanted workaround for certain security elements to allow for remediation.
4 Tips for a Chaos Engineering Security Plan
When running Chaos-style testing, it’s important not to fall into the trap of focusing on common, predictable problems. Instead, try to focus on issues that are unlikely but at least possible.
In fact, chaos engineering naturally requires testing bugs caused by both human error and system failures. Since the goal is to create “chaos”, limiting it to predictable behavior contradicts the goal. Therefore, test injections that introduce a high level of random error are usually the most effective.
Monitoring is the other key element of chaos engineering, especially when it comes to security validation. Ideally, the sheer volume of test data and possible combinations and interactions of events make it very unlikely that most errors will not be reproducible. This underscores the critical importance of data: if routine testing doesn’t gather all the possible information needed to identify and fix a problem, the entire process wastes a lot of time and money.
Logs and telemetry from both infrastructure and applications are a big part of meeting this requirement, as is accurate information about the events injected. Accurate and synchronized timestamps are particularly critical, as without them the relationships between specific causes and effects cannot be reliably documented. It’s the connection between chaotic events and poor outcomes that makes chaos engineering worthwhile, which is easily lost when accurate time records lapse.
The last key element of chaos engineering revolves around the people responsible for it. Security personnel cannot effectively conduct chaos engineering checks in isolation because they cannot accurately reproduce likely underlying system failures that trigger unexpected failure modes that allow certain security measures to be bypassed.
To be effective, chaos engineering requires collaboration between operations personnel and security teams. It is important to establish this collaborative model from the outset of implementing such a program, and equally important to guide it through the resulting test design, execution, and evaluation processes.