How ‘Chaos Engineering’ and ‘Security Differently’ Improve DevOps [Interview with Aaron Rinehart]

Aaron Rinehart is co-founder and CTO at chaos engineering startup Verica along with Casey Rosenthal from Netflix. Aaron began pioneering the application of security in chaos engineering during his tenure as the chief security architect at the largest private healthcare company in the world, UnitedHealth Group (UHG). While at UHG, Aaron released ChaoSlingr, one of the first open source software releases focused on using chaos engineering in cybersecurity to build more resilient systems. In a recent interview with Aaron, we discussed why it may be advantageous to apply the principles of Chaos Engineering and Security Differently to DevOps practices.

What was the trigger that made you develop ChaoSlingr and when did you know it needed to be open source?

Aaron: The trigger for ChaoSlingr was inspired by a number of different things I suppose. I first learned about the concepts of Chaos Engineering when UnitedHealth Group hired its first Site Reliability Engineer (SRE), Patrick Bergstrom from BestBuy.com. The first week that Patrick started at the company, several folks mentioned that Patrick and I had to meet. Following the guidance of my trusted friends and colleagues, we put some time on the calendar and got to know each other. Patrick described to me what an SRE is, the mission he was hired for at UHG and what kinds of things SREs do.

Having previously worked in Reliability and Safety Engineering at NASA, I was initially skeptical but curious about a Reliability Engineer being hired at the company. It wasn't until after Patrick explained to me what SREs do that I saw the clear difference. When Patrick was the role of SREs, he started talking about this concept of Chaos Engineering. He described it in terms of something that his team did at BestBuy.com to improve the resilience and quality of BestBuy.com’s technical infrastructure.

Later that evening I woke up from sleeping, thinking about the concepts of Chaos Engineering and writing experiments in the form of hypotheses. There was something about it that made sense to me for security as well but I couldn't put my finger on it initially. It wasn't until I started thinking about some of the challenges I was facing as the Chief Security Architect that it became clear. The challenge that struck me initially was the issue I had with uncertain and incomplete technical architecture artifacts that engineers and architects would bring to me along with asking for Security Architecture recommendations. I genuinely wanted to provide these folks with good security control recommendations but I was never sure how accurate the inputs were, which ultimately affects the outputs I give back to them. Furthermore, I was never sure if the guidance I provided was actually implemented or implemented correctly. I needed a way to ask the computer the questions to get the answers and validation I needed. Security Control validation was the original Genesis thinking behind applying Chaos Engineering to Security with ChaoSlingr.

So, the trigger to kickstart the ChaoSlingr project partially came from a novel new idea, a potential solution to a core problem, and lastly I wanted to continue to challenge the company’s leadership by showing them what was possible. After being part of the DevOps transformation I wanted to identify another way we could pull the company forward.

In regards to the Open Source question, we initially weren’t even thinking about that as an option—mostly because we weren’t even sure what we were doing at first. This was an unfunded group of passionate engineers, architects, and basically volunteers wanting to work on interesting ideas. Once we got about three weeks into the project, someone threw out the idea that we make this our first open source project. The timing was not only ideal, also I was in the right position to make it happen. Prior to working on the project I had been involved in working with our Open Source Attorney, Kevin Nelson, in the creation and approval of our Open Source Software Consumption and Contribution Policies. I additionally worked with Kevin to iron out and implement an official process for approvals and artifact collection to properly meet the company’s needs in open sourcing a software project.

It made it seem like a possibility, given I knew the process, who was needed and so on. We brought a few more folks to the growing initiative to help us with the paperwork and, after about eight weeks, we not only released the company’s first Open Source Software project but also introduced a new method of instrumenting security to the industry. There is a reason why there are so many names listed in the GitHub repo for ChaoSlingr. Not everyone on that list worked on the code for ChaoSlingr, but—given the project was unofficial and unfunded—people put extra time in their day to work on it and see it succeed. For some folks it was helping with the legal paperwork, for others it was writing and testing code.

What is Security Differently?

Aaron: Security Differently is my attempt to translate over seventy years of research, practices, and lessons learned from the field of Safety Engineering to cybersecurity. Specifically ‘Security Differently’ comes from Sidney Dekker’s ‘Safety Differently’.

Safety Differently is the name given to a movement within the safety industry that challenges organizations to view three key areas of their business differently—how safety is defined, the role of people, and the focus of the business.

The problems that plagues traditional safety practices are underscored by 3 main principles:

Workers are considered to be the cause of poor safety performance. Workers make mistakes, they violate rules, and they ultimately make safety numbers look bad. That is, workers represent a problem that an organization needs to solve.
Because of this, organizations intervene to try and influence workers’ behavior. Managers develop strict guidelines and tell workers what to do, because they cannot be trusted to operate safely alone.
Organizations measure their safety success through the absence of negative events.

Safety Differently challenges the traditional paradigm by flipping traditional thinking on its head, and encourages organizations to grow safety from the bottom, up—rather than impose it from the top, down:

People are not the problem to control, they are the solution. Learn how your workers create success on a daily basis and harness their skills and competencies to build a safer workplace.
Rather than intervening in worker behavior, intervene in the conditions of their work. This involves collaborating with front-line staff and providing them with the right tools and environment to get the job done safely. The key here is intervening in workplace conditions rather than worker behavior.
Measure safety as the presence of positive capacities. If you want to stop things from going wrong, enhance the capacities that make things go right.

So, what does all this have to do with cyber security? Safety Engineering has a striking number of similarities, as a field, when compared to Security Engineering. I began discovering Sidney Dekker’s research about the same time as we were building and running ChaoSlingr. His work on the ’Field Guide to Understanding Human-Error’ and ‘Drift into Failure’ are seminal works in the field of airline accident investigations. The approach Sidney takes to his work in safety engineering is rooted in the fact that airplanes and how they operate are complex systems in nature.

Complex Systems are not systems that are complicated; they have very specific traits that differ from a Simple System. Characteristics of Complex Systems have the following traits:

It’s difficult to determine their boundaries
It’s difficult to model behavior
They are known to produce emergent phenomena
They are non linear in relationships
They are comprised of feedback loops contained within relationships
They are prone to cascading failures

From Wikipedia:

“Complex systems are systems whose behavior is intrinsically difficult to model due to the dependencies, competitions, relationships, or other types of interactions between their parts or between a given system and its environment. Systems that are "complex" have distinct properties that arise from these relationships, such as nonlinearity, emergence, spontaneous order, adaptation, and feedback loops, among others.”

Examples of Complex Systems:

Global Financial Markets
Nation-State Politics
Weather Patterns
Human Body
Bird Patterns
Distributed Computing Systems

One of the fundamental issues with the massive numbers of technology outages and breaches we are facing today is partially a result of a lack of understanding on how complex systems work. Sidney Dekker made great strides in the field of Safety Engineering by acknowledging that he was dealing with a Complex System and that he would need to think differently in order to make meaningful progress.

Furthermore, the outcomes of this research in Safety Engineering and Resilience Engineering end with the human as being the solution to the problem, not the cause of it. Humans are not the cause of our problems, even in the beginning of the life of a complex system. In fact, no system is secure, reliable, resilient, safe, etc. by default. These attributes are inherently human constructs and require humans to build and maintain them. Just as it was in Safety Engineering, it still is in security; we tend to want to identify a root cause and that typically ends with a “human-error”.

These words and concepts really don't exist in complex socio-technical systems. There is almost never a root cause for anything. A simple thought exercise easily proves the point when you ask yourself to identify one root cause for why you are successful as a person or why the Egyptian pyramids have stood the test of time. I'm not challenging anyone’s intelligence here, but it's impossible to identify a single root cause for why something was successful. For this reason it's also impossible to identify a singular root cause for why something wasn't successful. Sidney Dekker likes to point out that if you find a root cause, especially one that ends in human-error, you have now identified the beginning of your investigation—not the end.

Lastly, Security Differently is about educating the broader industry about these concepts and the challenges they are presenting to our ability to be effective at system security. It is my belief that until the cybersecurity industry begins to understand these fundamental learnings we will continue to see breaches, outages, and headlines exponentially climb. What strikes me as alarming is that there are decades of lessons learned, proven practices, and research that the fields of nuclear engineering, safety engineering, resilience engineering, medicine and cognitive sciences have collectively learned and agreed upon, that the cybersecurity industry is ignoring. My goal is to drive as much awareness as I can to the fact that “More knowledge is obtained by the ear than the tongue” - Benjamin Franklin

In part 2 of this interview, we’ll discuss ways that you might consider leveraging Chaos Engineering principles to build predictive security into DevSecOps.