Nearly every organization struggles with certificate-related outages. For people that don’t work with PKI everyday managing TLS certificates seems like it should be very straight forward, but even large organizations with strong IT and security practices fall victim to certificate outages regularly.
I have been at Venafi for almost 9 years and during that time I’ve worked with clients from around the world. Before that, my focus was network and systems management and network operations, so I’ve been in the trenches both as a vendor and as a team member trying to keep systems up and operating reliability.
I’ve seen a lot of organizations from all kinds of industries in various stages of maturity when it comes to managing and securing their machine identities. At this point I’m pretty much able to predict the challenges, struggles and pains that an organization is having and going to have based on the maturity level of their machine identity management program.
These are real world stories that I have personally seen multiple times while working with organizations all over the world.
Joe requested a certificate, so Joe’s email address is listed as the owner of the cert. An email was sent to Joe 30 days before the certificate was set to expire.
- Problem 1 – Joe changed teams and he’s sure that someone on his previous team is now responsible for that old cert.
- Problem 2 – Joe left the company 6 months ago.
- Problem 3 – Joe’s super busy with important stuff. He sees in the email that he’s got 30 days until the cert expires so that means he has 29 whole days until he needs to handle this. You can guess what happens.
Certificate Management via Spreadsheet / Wiki / SharePoint
Susan created a spreadsheet to track certificates. When someone requests a certificate, Susan logs the cert name, requestor and expiration date. Every week Susan generates a report to identify the certs expiring within 30 days. She sends an email to the owner to let them know the cert is going to expire.
- Problem 1 – See “ownership problem” above.
- Problem 2 – Susan goes on a vacation, or gets sick, or takes a few days of unexpectedly. Who’s going to update the spreadsheet when she’s out? Are we not going to allow any cert requests while she’s on away?
I know what you are thinking right now. “Come on Venafi, of course she gets vacation (or sick leave or whatever). No organization is going to place such an important task on just one person.” OR you might be thinking “Dang! This Venafi guy knows exactly what my life is like. This is what I deal with every day.” If your organization is trying to manage certificates using a static list that is manually maintained it doesn’t really matter which belief you have, you will fail and eventually one of those failures will be significant.
You know that spreadsheet or SharePoint or wiki that Susan at Company X created to track certs? Or you know some other system or tool that Tony at Company Y uses to track his certs? What if it doesn’t track where the cert is installed? For that matter, how do either Susan or Tony know where any cert is installed?
Tony has a form that he uses for certificate requests. Susan uses tickets. The form and ticket ask the requestor to provide the information on where the certificate will be installed. So now 30 days before the certificate is going to expire Susan and Tony both send email notifications to the cert owner. Susan even can open a ticket to let the owner know the cert is going to expire. In the ticket and notification, it tells the owner where the cert is installed based on the info, they provided 2 years ago (1 year ago beginning this Sept. But that’s another story that will complicate Susan’s and Tony’s lives even more). The owner responds and says they need to renew the cert. Tony and Susan both follow the processes for their respective organizations and provide a renewed cert to the owner well before the certificate expires. The countdown begins: 20 days, 10 days, 5, 4, 3, 2, 1. OUTAGE. What the heck happened?
- Problem 1 – The cert was updated on the load balancer, but someone forgot that the cert is also installed on the app server behind the load balancer.
- Problem 2 – The cert was installed on a cluster of web servers. It was updated on 4 of the 5 servers but somehow, we forgot it was installed on the 5th.
He said / She said
This is not always the blame game. Sometimes, maybe even most of the time, this is a communication or process issue. Here’s what goes wrong:
Company A is a hosting provider of some sort. Their customers need to use certificates to access Company A’s services. In some scenarios the customer is responsible for the cert and others Company A might be responsible for cert generation. In either case, if the cert is not managed, monitored and secured properly there will be an outage. And guess what? Even if the customer was responsible for the cert, it will still be Company As fault the cert expired because it is their service the customer is using, and the customer is always right.
In some organizations the app team is responsible for the certs their apps are consuming. In other organizations the device owners are responsible for the certs installed on their devices. In some organization the SecOps team is responsible. In other organizations it’s a mix. Who gets notified? Who must approve this spend? In these mixed responsibility situations, each potential owner thinks things like:
- “I don’t have access to the webserver where that cert is installed so it’s not my job.”
- “There’s no ticket assigned with my name on it so I’m not your man.”
- “My app is runs on several systems which are managed by the Ops team so I’m sure they’re going to renew the cert each year.”
There are endless variations on this theme - 9t’s easy to see how this can become confusing.
Restarting services / daemons / bindings
App owners and Ops teams are busy. Their days are filled with tasks to deploy new things and keep everything else running. Installing certs is not something that they do every day. So, when they get notified that a cert is going to expire soon, they follow the corporate process to get the cert renewed. Once the cert is renewed, they need to install it. They copy the cert and key into the appropriate location and assume all is well.
25 days later there’s an outage with a severity 1 ticket. The app owner or ops team checks the database. Nothing. They check the network. Nothing. They check the VM. Nothing. They check physical. Nothing. They check the app stack. Nothing. The check all the logs. Nothing. (If this happens on a critical system everyone’s blood pressure is ticking up a notch or two by this point).
At this point someone says, “Wait, isn’t this the system where we just renewed the cert?” Turns out someone copied the new cert to the system but didn’t do the final binding and/or restart services. Because these things didn’t happen the original cert was still in operation when it expired so they had an outage.
For organizations without a strong machine identity management program these fundamental problems tend to show up; regardless of the type of organization, their business model and how they use of machine identities.
If reading about these issues gives you a strong sense of de ja vu, and you’d like to figure out how to solve these problems once and for all, check out our approach. It’s helped many of customers eliminate certificate related outages completely.
- Venafi Study: Are Financial Service Organizations More Likely to Suffer Certificate-Related Outages?
- Majority of Businesses Still Experience Outages: Are You Protecting Your Certificates?
- GAO Report: Expired Certificate Allowed Extended Exfiltration
- How Big Is Your Risk of Certificate-based Outages?