On the Venafi blog, we frequently share stories of organizations that are negatively affected when a TLS certificate expires and causes an outage. Google Voice and Epic Games are two examples. In these cases, and many others, the story is often similar—a certificate expires and makes the system or application it’s running on unavailable. Or a wildcard certificate expires and takes down more systems or applications than were originally accounted for.
We’ve also talked a lot in this blog about how to eliminate outages. We’re fortunate to work with many organizations that have solved the certificate-based outages challenge. Throughout the years, we’ve carefully looked at what these successful organizations do and we see very consistent patterns for them versus the organizations that we still see struggling with outages. We’ve documented what we’ve seen in our VIA Venafi: Eight Steps to Eliminating Certificate Based Outages methodology.
As you can see, the first step is to “Establish an Outage Safety Net”. When an organization is experiencing regular outages, the most important thing is to first stop the bleeding. That’s where an Outage Safety Net comes into play. When there is a certificate that is imminently about to expire, a well thought out and implemented Outage Safety Net alerts critical parts of the organization about the impending outage rather than trying to track down an individual owner of the certificate because the latter approach is often ineffective and time-consuming.
The process for finding any certificates that have fallen through the cracks and are likely to expire and create an outage will vary from organization to organization. For some organizations, it might be the last step after multiple escalations of certificate expiration alerts. For others, it might be when there is a severity 1 ticket created. The key point is that the Outage Safety Net is not the standard notification process for certificate renewals. Rather, it is the exception to catch certificates that are not addressed in the normal process and replace them before they expire.
Why start with an Outage Safety Net?
You might be asking yourself, “Why start with a safety net when the real problem is to deal with certificate renewals?” We know from experience that eliminating outages on a permanent basis across the entirety of an organization is a long-term undertaking because you're transforming behavior across the entirety of the organization. So, the importance of this first step is to deliver the immediate benefit of stopping the largest number of outages in the shortest amount of time. This also gives you the ability to focus on the real problem of dealing with certificate renewals without being distracted by constant outages.
Let me describe it in a little more detail. The Outage Safety Net is a way to bring attention at an organization-wide level when there's an outage that's about to happen. There is nothing new about this concept to any large organization. Several years back, I worked for a company that focused on vulnerability management. All the organizations this company worked with had a mechanism in place that allowed them to highlight enterprise-wide issues. If a vulnerability was discovered in, for example, the Apache software, they had a mechanism to share that information across their organization so that anyone who touches Apache became aware and could react to it. This security mechanism also established which internal contacts have the authority and know-how to act when such an event was triggered in the case of a primary system owner not being available. An Outage Safety Net is nothing more than exercising that mechanism around outages.
One advantage of implementing an Outage Safety Net for certificates versus vulnerabilities is that outages are predictable, and you know when they are coming. When you go to sleep on a Tuesday night, there’s no way of knowing which, if any, vulnerabilities will be identified on Wednesday that will require your immediate attention. With certificates though, you can determine exactly which day they will expire. When you detect that a certificate is nearing its expiry date, you can inject that information into the alerting mechanism that already likely exists. As a result, you can enable the whole enterprise to mobilize against preventing that outage.
The Outage Safety Net as the first step in the VIA Venafi 8-step methodology for stopping outages because it's the fire alarm. With a good fire department and mechanism to respond to fires, you buy time to work on future fire prevention in the form of the seven remaining steps to eliminate outages in a more widespread and sustainable way.
Learn more about VIA Venafi and why we are so certain that Venafi customers who follow the Venafi Way will experience no certificate-based outages.
(This post has been updated. It was originally posted on November 17, 2021.)