This time, our DevSecOps conversation is with Nicolas Chaillan, US Airforce Chief Software Officer and co-lead for the Department of Defense (DoD) Enterprise DevSecOps initiative. Nicolas has generously shared the US DoD’s DevSecOps Reference Design.
IMO the DevSecOps community has been slow to share reference architectures for its own benefit. What was it that made you share yours and did you have to jump through a lot of hoops in order to make it available in the public domain?
Nicolas: Coming from the commercial side, I have always thought it would be very important to bring everyone with us and lead by example so being able to share was really critical to me. We made it very clear from the start that everything we're trying to create, in terms of best practices and documentation, would be publicly shared. I think it's easier now than it used to be in DoD to do that. The culture is shifting but certainly we're one of the first teams to really push for this kind of content to be completely public. It’s the same for the containers we're creating—they're all open source. That's unique in the DoD and writing open source code is pretty new as well.
For readers: Access the DoD open source content here.
What feedback have you had from the community on your reference design?
Nicolas: First, we shared all of the reference design materials across the entire DoD, so we had quite a bit of feedback there. And then we also had about 400,000 views within a week on LinkedIn, which was pretty amazing. There were about seventy revisions before it was published which required a lot of collaboration between teams and all the people involved. The main draft was complete within a couple of months, but it then took about nine months in total to go through the process to publication. It’s our intention to update it around every six months.
Why, in your opinion, is security often an afterthought?
Nicolas: People have different priorities, but when you don't bake security in, the number one issue is that it's very difficult to keep up with the pace of the changing requirements. If you have security as an afterthought, it just becomes this massive bottleneck. Being able to have it baked in from the start is critical as otherwise it's tough to catch up. If you scan your code multiple times a day and you're looking at the quality of the code continuously you can fix small changes slowly but surely every day. Incremental change is the critical piece. If you do it multiple times a day, it's much easier to fix than waiting a year and trying to tackle a huge mound of what is effectively technical debt.
How does your reference architecture look to address vendor lock-in?
Nicolas: There were three components that were critical to that decision—and it was a top priority of the department to make sure we were not getting locked into cloud providers or platform providers. The first aspect was to pick Kubernetes and make sure that whatever product ended up being picked for the platform was CNCF compliant. We have Red Hat’s OpenShift, Pivotal PKS, VMware Tanzu and Rancher. There are so many products that can do that now and that's why we picked it; so that we would not be locked into a single company.
Kubernetes offers an abstraction layer to cloud, networking, storage and compute and it works across multiple environments whether it's a disconnected air-gapped environment, whether it's on premise or cloud. So it's perfect for us. Using Kubernetes for orchestrating our stack is the number one part of the foundational layer. In a way we call it a product but it's more like a standard that will guarantee that the system behaves the same, but you can still pick different products.
The second aspect is containers, which had to be compliant with the open container initiative (OCI) so it's all agnostic to the product and we don't get locked into companies like Docker.
And then the third aspect was really the side-car container security stack that brings all the behavioral detection, the continuous scanning and continuous monitoring and zero trust down to the container level. The side-car concept is part of the OCI and Kubernetes CNCF compliance; by combining that and making sure that all the tools and the entire stack is containerized, that gives us abstraction, self-healing and scaling.
Your reference architecture discusses zero trust environments and the need for mutual TLS (mTLS) down to the Pod. What kind of security features are automatically injected into the application without developer intervention via a sidecar container?
Nicolas: We use Istio, which is also open source and we use service mesh. That also abstracts us from a single product and brings us that baked-in security—the key aspect of zero trust down to the container level. We get a mutual TLS tunnel by using it. We can whitelist two containers to talk to each other which gives us denial by default and the ability to do that tunnel with strong identities with certificates. So we know who is who and what is talking to what.
Is that the guidance that you would give to people who believe that just hardening the outside is sufficient?
Nicolas: I would say, if you want to follow best practice today, you really have to get down to zero trust on the container or even function. We use Knative, which is also open source, for Function as Service (FaaS) or serverless. We don't use things like Lambda, which would get us locked into a single cloud provider like Amazon. It’s really important for us to have that baked-in security and the abstraction layer is critical.
How are you addressing the need for crypto- and cloud-agility for non-person entities(NPE)/machine identities (e.g. SSL/TLS certificates)?
Nicolas: That's also something that Istio does for us by automatically generating TLS certificates per container. It manages all the traffic between containers and does the zero trust and authentication aspect as well.
What process do you go through in order to identify bottlenecks and prioritize automation opportunities?
Nicolas: We start by detecting where human bottlenecks exist and aim to automate these aspects. We aim to automate the instantiation of the stack to avoid environment drift. We have the full mapping of the process so we know which stages may have a manual review because it's a nuclear system or it's a weapon system that needs to have some manual checks. It’s fine to have some manual interventions; we're not going to automate everything. The key is to go from a hundred percent manual to, say, 95% automated and 5% manual so that the human can actually focus on the really critical stuff instead of the boring stuff.
We use a kind of value stream mapping, it's not fancy. We map the process to visually represent all the steps and stages, including the manual bottlenecks and keeping track of the average time. We do it for everything from hiring to building a weapon system; it's not just the technical stuff, it's really every aspect of the chain.
I would say we could improve this as a cross-enterprise discipline. The issue of duties was just so big. It's always tough to spread the knowledge and the best practices. That's why we built the website just a few weeks ago. We have training sessions and that's where we also push the guidance and the memos and artifacts so the message spreads. We needed that central place for people to find all that content. We'll do better at sharing, but it's so big, so many people, it's just very difficult. I think scaling now will be the biggest challenge. We have 37 teams moving to DevSecOps, including the largest weapons systems, business systems and cyber offense defense.
How easy has your organization found it to move from big batch releases to small, incremental updates?
Nicolas: That's probably hardest for weapons systems because they have a tough time cutting the weapon—a weapon is a weapon, it can't just be half of a weapon. They have to prioritize the features that realize and maybe use fewer sensors and capabilities. But that's really back to just agile. It's not just a DevSecOps thing. It's about moving to smaller domains and microservices and making the product more modular and that's probably the biggest culture challenge here. That's very hard and that's why we're doing a lot of training.
What does DevSecOps culture look and feel like?
Nicolas: Right now, it takes usually between five to ten years to deploy a weapon system. If you want to go to small incremental changes and faster delivery when it comes to adding features like AI or machine learning, it's very critical to have fast mean time to production. So it's all about cutting, what's a minimal viable product? What does that even mean for a weapon?
Culturally, do you have conversations about trust and how security people behave with engineers and how engineers behave with security people?
Nicolas: Yes, and that's why we're trying to shift them to the left as well, because we know that the entire process is going to change for these people. It’s a big thing both for testing and full cyber. The cyber people will have to move to development as well because if we're doing infrastructure as code with immutable architectures, they have to change code and not just go into production and run commands. So it's all back to code again.
Will DevSecOps live forever?
Nicolas: Probably not. Nothing lives forever. And it shouldn’t. I think DevSecOps will just keep evolving like agile. For me, if DevOps is an evolution of agile, then DevSecOps is an evolution of DevOps. I think it's continuously changing. If you look at DoD and the way they fund and manage programs these days, we have the operations and maintenance phase and then the sustainment phase. Now we're moving to continuous engineering where it's continuously evolving and there's no separation between O & M and sustainment. The funding will become continuous and no longer tied to sustainment, O & M will be research and development and continuously evolving.