Working on the client side of Machine Identity Management, I have witnessed the aftermath of expensive certificate-related outages. These incidents convinced me of that before an organization can know “Who” is on the network, it must first be established “What” is on the network.
Early in my career, I was working 12-hour shifts in the data center. I remember one night at 10:00 PM, all these people show up. They're very anxious, because they needed to reset the master password on the HP Atalla HSM that was used to protect debit or credit card transaction flows. And that type of outage means real costs in terms of revenue lost.
Later on, one of the teams I ran would regularly receive hundreds of incident tickets and implement dozens of change controls on a weekly basis. P1 tickets were the most severe: “the server is down. What happened? It’s a potential $500,000 per hour loss or a potential 500+ user impact”. These tickets would require immediate investigation to restore service.
CIO Study: Automation Vital to Address Shorter Lifespans and Massive Growth of TLS/SSL Certificates
So I’ve heard my fair share of outage horror stories. And here are the types of events that you’ll want to do everything in your power to avoid. Let’s say you’re an AIX Administrator who accidentally zeroed out the /etc/passwd file on 500 servers. You want to talk outrage, right? The responding team would literally have to go and call tape media back and very rapidly become fluent at how to do full Unix system restores—all while the clock is ticking on untold millions of dollars of outage. Despite your best intentions to automate a routine process, you may end up losing your job over a situation like this.
There’s nothing like the panic you feel when an incident has gone all the way through a variety of different support teams, unresolved. Here’s a particularly bad situation: nobody can log into your UNIX/Linux servers, creating a cascading and disruptive effect. Cases like this often take hours of investigation to finally establish what had gone wrong. After hours of arduous detective work, you trace the problem was back to people’s passwords in LDAP. LDAP is highly available and because the LDAP infrastructure hosts sensitive encrypted passwords, they needed to be replicated using strong encryption. As soon as that certificate expired, LDAP authentication requests were denied, and nobody could login to UNIX/Linux. Only when the certificate is finally replaced with a renewed version will everything start working again.
The bottom line is that no one really wants to be called in to pinch hit for a particularly challenging outage. However, despite their best efforts, your operations teams will occasionally get an incident they can't resolve on their own. So, it bubbles its way up from the level one support people to the level two support people to the level three support people. And then, eventually hit your desk and you have to divert your staff to an emergency fix.
Over the years I’ve been exposed to a fair share of pain from certificate related outages on both sides of the client/vendor line. Certificates may seem like such a routine part of our security regimen that it’s easy to underplay their significance. I’ve learned that you only have to lose control of one certificate and the entire organization can feel the pain. And the average enterprise uses hundreds of thousands, if not millions, of certificate instances. And that’s one of the reasons why I’m here at Venafi—to help people avoid the pain of certificate-related outages.
Do you have visibility of your entire inventory of machine identities?
Get a 30 Day Free Trial of TLS Protect Cloud, Automated Certificate Management.
Related posts
- Venafi Study: Are Financial Service Organizations More Likely to Suffer Certificate-Related Outages?
- Majority of Businesses Still Experience Outages: Are You Protecting Your Certificates?
- GAO Report: Expired Certificate Allowed Extended Exfiltration
- How Big Is Your Risk of Certificate-based Outages?