In part one of my interview with Aaron Rinehart we discussed the work he is doing as co-founder and CTO at chaos engineering startup Verica. In part 2 of this interview, we talked more about how chaos engineering relates to Rugged DevOps and DevSecOps.
How do you define DevSecOps?
Aaron: I still subscribe to the thinking that the movement was born out of the manifested issues between Dev and Ops working in siloed functions. Security should just be a by-product of good software development and good operations work. After all, do we really need a security team that wants to treat itself as another function? One of the biggest challenges facing information security professionals is that very few of them understand where they fit in the value chain where they work.
The industry over the past few decades has produced a Catch-22 situation in that the creation of the CISO and the Information Security function within organizations was intended to create accountability for security. The problem with these functions is that, by design, they are not in the value chain. How many security professionals can say that they meaningfully contribute direct value back to the products and services that a company provides to its customers?
Many times, all security professionals can do is point to a software engineer or ops engineer that something appears to be insecure. The majority of the security functions we have created today cannot actually do anything to directly improve or change the state of a systems security without having an engineer do it for them. The security industry loves to say catchy things like “Security is everyone’s responsibility”, but to a software engineer who spends tireless hours trying to invent a new capability that has never been created before this sounds a bit indolent. If we have a group of highly paid professionals charged with security as their job function that can’t, don’t or even won’t do anything to help create it, there is a lack of balance and alignment in this equation. Cybersecurity has always been an engineering problem.
After all, we didn't have a field called cybersecurity until we had computers. The catchy phrase we should all really be saying is “Everyone is responsible for engineering”. In summary, we need to stop inserting the word security into everything. Just DevOps is fine in my opinion.
What is the relationship between Rugged DevOps, DevSecOps and Chaos Engineering?
Aaron: I have never actually tried to relate the concepts together before. My gut response would be to say that they all are about building highly performant, secure and resilient systems through making the right thing to do the easy thing to do. I have always seen Rugged DevOps and DevSecOps as being the exact same thing. People that tried to establish a different meaning to each just confuse me. As in my previous response it's just DevOps; over time it will just be how we normally build things.
There are definitely some tie-ins with Rugged/DevSecOps and Chaos Engineering. It's not easy to directly connect the relationships between them for the same reason folks have trouble finding a clear relationship between DevOps and SRE practices. Chaos Engineering was originally meant to be a toolset for SREs to help them understand the behavior of their systems. Chaos Engineering is the only proactive method that can identify potential incidents or outages before they happen.
Chaos Engineering proactively introduces turbulent conditions into a system to try to determine the conditions by which a system or service will fail before it actually fails. As a technique it is designed to help correct our mental model of how we think the system works versus how it works in reality. Casey Rosenthal, the creator of chaos engineering at Netflix, likes to describe Chaos Engineering as being part of an emerging series of software techniques known as continuous verification. Once an engineering team achieves continuous integration and continuous delivery, the speed, complexity and scale of this new world necessitates the need for an ability to continuously verify the system is behaving as expected post deployment.
The reason why Chaos Engineering has become such a popular practice is that our systems almost never behave the way they were supposed to. If I had to establish a relationship between the three practices, I would say they are overlapping complementary practices that reinforce one-another throughout the lifecycle of a complex system.
Are there any additional considerations people need to make when they are chaos engineering in the cloud? Particularly when using containers?
Aaron: I would say that it's critical that you follow the Principles of Chaos articulated by the originators of Chaos Engineering at Netflix.
It's important to understand that Chaos Engineering is not about “breaking things in production”. I’m pretty sure if you go around breaking things in production, where you work, you won’t have a job very long. Chaos Engineering is not more about fixing things in production. It's very important, when practising Chaos Engineering, to start off experimenting in a lower environment, also known as development, testing, staging, etc. It's important to ensure you are confident in the toolsets you are using, to confirm your understanding of the blast radius of the experiments as well as to validate that your monitoring and observability toolsets are effective. Lower environments are ideal for building your maturity in Chaos Engineering. Furthermore, don't discount the opportunity to learn from lower environments as well. Yes, they may not be the actual production system where the business-critical outage will actually be, but you would be surprised what you can learn about your production system through running chaos experiments on your staging environments.
One of the largest retailers learned this very thing when they decided to run a chaos experiment on Kafka in their staging environment. Their expectation was that if a Kafka broker went down, another one would immediately spin right up and the brokers would rebalance. They went forth and executed this experiment in their Kafka Staging Environment and it turned out that when they brought down a node, it brought production down instead of staging. This taught the engineering team a valuable lesson about how their system was actually configured. They had forgotten to change the pointers and configuration information and production was still tied to staging. This is the reality of how our systems really work. Our systems will almost never fail the way we expect them to; if they did, we wouldn't have outages because we would just fix it.
As for cloud-specific recommendations or container-specific recommendations, all the above still applies. There is a wealth of information and tools for cloud-based Chaos Engineering experiments. My ask for people getting into this space and exploring how their systems fail, is to do their own thinking and not just run a tool. Most of the open source tools on the market are just recreations of Chaos Monkey in a different language, architecture or tech stack. There is a vast domain of failure and faults you can explore. One of the best places to source experiments is from post-mortem or incident data. You can learn a lot about an organization and how its systems fail through its unexpected outcomes AKA incidents/outages.
Why are certificates still a challenge for DevOps?
Aaron: My best guess would be rooted in the way certificates have historically been provisioned and managed. The provisioning process has traditionally followed a change request process that usually involves some sort of service request or ticketing system. The solutions commonly being used for certificate lifecycle management and provisioning are still very monolithic in nature. If teams attempt to circumvent those solutions, for the sake of development velocity, they run the risk of generating insecure certificates that are not centrally managed.
This initially may not prove problematic, but when the app goes live into production and nobody is managing the certs’ lifecycle and rotation, you run the risk of expired certificates causing catastrophic outages. Furthermore, you also potentially run the risk of not being notified or taking action when there are compromises in the certificate chain of trust. So, in summary, the answer probably rests with poor tooling and culture gaps between information security and DevOps teams.
Yourself and your colleague James Wickett have spoken at the RSA Conference about The Security, DevOps and Chaos Playbook. What’s this all about?
Aaron: The talk we gave at RSA is a composition of the research both James and I have been working on individually, as well as together for over the past six years. The MEASURE Framework makes up the following:
- Unrestrained Sharing
The MEASURE framework combines the work James has been doing since the beginning of DevOps and my work in Safety, Resilience, and Chaos Engineering. Both James and I are software engineers that got into security. I believe, by nature of the situation of their being an engineer first, puts those folks at an advantage.
The advantage being that you know how things are really built and what it takes to do that work. Some of the points we highlight are around security being about the making, not just the breaking, and that security must be able to write code. Obviously, some of these elements cause friction with many folks in the industry as expected, but in a world now dominated by software defined everything it doesn't seem like that much of a stretch. I mean, if you have never built software before, how would you know what building good software or secure software looks like?
The process of security folks learning how to code helps bridge the empathy gap between software engineering and security. You really get a different perspective when you have to eat your own dog food. Other key points we make are about how people learn through experimentation and failure. If we as an industry discourage experimentation and failure, we discourage learning. I don’t think I’ve ever done anything perfectly the first time I did it. How can we expect that of others?
Who is Verica and what are they about?
Aaron: Chaos Engineering is a practice that came out of Netflix to proactively discover vulnerabilities in large software systems. Verica is founded by the two people behind Chaos Engineering: we defined it, wrote the book on it, managed the pioneering teams and run the conferences.
Verica is bringing that experience to the enterprise market in the form of a Continuous Verification platform that proactively uncovers system weaknesses and security flaws before they disrupt a business. As the next step in the evolution of Chaos Engineering, continuous verification provides a disciplined methodology to prevent availability and security incidents.
Will DevSecOps live forever?
Aaron: I do think DevSecOps will live forever. It is my belief and hope that it lives on forever as the new normal in how we build software.
- DevSecOps: Minimizing New Attack Surfaces for DevOps [Interview with Mitchell Ashley]
- What Is Your DevSecOps Manifesto? [Interview with Larry Maccherone]
- US DoD Reference Design for DevSecOps [Interview with Nicolas Chaillan]
- DevSecOps, SecDevOps, or RainbowMonkeyUnicornPony? [Interview with DJ Schleen]