Everyone knows that certificate outages are painful. Just ask anyone who has had to deal with the tangled aftermath of an expired certificate. There are so many unknowns. And so many unanticipated consequences. And that’s perhaps why, when it comes to measuring the specifics of just how bad a given outage was, the details often get blurred by the post traumatic stress. So it’s hard to get answers that quantify the impact. How long was the outage? Too long. How many systems were impacted? Too many. How much revenue was lost? Too much. But that particular type of denial won’t help anyone avoid a similar outage from happening again at some point in the future.
That’s why it’s so amazing that Epic Games was entirely transparent about a certificate outage that impacted the company on April 6. In the spirit of openness and goodwill, the company shared their outage story with the world. In their own words, “It is embarrassing when a certificate expires, but we felt it was important to share our story here in hopes that others can also take our learnings and improve their systems.”
The company goes on to reveal in-depth details about why the outage happened, how big was the impact, and how long it took to fix. This is incredibly valuable information to help organizations everywhere understand why they need to take certificate management seriously. This level of sharing is downright…well…epic! And I applaud Epic Games for this heroic level of candor and downright altruism.
CIO Study: Automation Vital to Address Shorter Lifespans and Massive Growth of TLS/SSL Certificates
It’s bad enough when one system goes down. But what you will see in the story that Epic Games shares is that certificate outages often have unanticipated, critical impact on systems beyond those directly involved in the original outage. Epic Games outlines two additional areas of substantial impact beyond the initial outage triggered by the expired certificate:
- An expired certificate caused an outage across a large portion of internal back-end service-to-service calls and internal management tools
- Unexpected, significant increases of traffic to the Epic Games Launcher, disrupted service for the Epic Games Launcher and content distribution features
- An incorrect version of the Epic Games Store website referencing invalid artifacts and assets was deployed as part of automatic scaling, degrading the Epic Games Store experience
It’s hard to imagine a more careful complete summary of the impacts of certificate outages. Many companies choose to overlook the peripheral impacts. In this case, over 25 critical staff members were pulled away from other pressing duties to repair the damage. Millions of connections were disrupted. And thousands (not quantified) of frustrated customers were offered invalid content from the company’s online store. This brings concrete meaning to otherwise vague terms like lost revenue, diverted productivity, customer dissatisfaction and brand damage.
But the relatively mild user irritation caused by a few minutes of outage did not dissipate once the expired certificate was repaired. As I suspect is often the case, the impact lasted much longer than anyone could have predicted. While the expired certificate was detected and replaced in a near record time (approximately 37 minutes), the aftermath lingered on for nearly 5 hours afterwards. Here’s the exact timeline that Epic Games shared:
- 12:00PM UTC - Internal certificate expired
- 12:06PM UTC - Incident reported and incident management started
- 12:15PM UTC - First customer messaging prepared
- 12:21PM UTC - Confirmation of multiple large service failures by multiple teams
- 12:25PM UTC - Confirmation the the certificate reissue process has started
- 12:37PM UTC - Certificate is confirmed to be reissued
- 12:46PM UTC - Confirmed recovery of some services
- 12:54PM UTC - Connection Tracking discovered as an issue for Epic Games Launcher service
- 1:41PM UTC - Epic Games Launcher service nodes restarted
- 3:05PM UTC - Connection Tracking limits increased for Epic Games Launcher service
- 3:12PM UTC - First signs of recovery of Epic Games Launcher service
- 3:34PM UTC - Epic Games Store web service scales up
- 3:59PM UTC - First reports of missing assets on Epic Games Store
- 4:57PM UTC - Issue with mismatched versions of Epic Games Store web service discovered
- 5:22PM UTC - Epic Games Store web service version corrected
- 5:35PM UTC - Full recovery
Now that is an afternoon that I would not wish on anyone. But congratulations on a successful resolution. So, how can you be sure that this won’t happen to your organization? First, as Epic Games now does, you need to recognize the critical importance of each and every digital certificate that acts as a machine identity anywhere in your network. You need to know how many you have, where they are being used, and…yes…when they will expire. Once you are armed with that information, you can safely automate the entire certificate lifecycle so that there will be no nasty surprises.
Venafi offers a comprehensive platform for machine identity management that has helped the world’s leading companies keep track of their certificates and avoid outages. In fact, based on the lessons we’ve learned from working with 400+ global customers, we’ve created a proven, 8-step methodology that combines people, process and technology. If you follow this blueprint, we guarantee that you can stop TLS certificate-related outages forever.
Tired of worrying when your next certificate outage will hit? Contact us.
Take Control of Your Machine Identities With Automation and ELIMINATE Outages!
Related posts
- Venafi Study: Are Financial Service Organizations More Likely to Suffer Certificate-Related Outages?
- Majority of Businesses Still Experience Outages: Are You Protecting Your Certificates?
- GAO Report: Expired Certificate Allowed Extended Exfiltration
- How Big Is Your Risk of Certificate-based Outages?