Anatomy of a Certificate Outage [Epic Games]

Everyone knows that certificate outages are painful. Just ask anyone who has had to deal with the tangled aftermath of an expired certificate. There are so many unknowns. And so many unanticipated consequences. And that’s perhaps why, when it comes to measuring the specifics of just how bad a given outage was, the details often get blurred by the post traumatic stress. So it’s hard to get answers that quantify the impact. How long was the outage? Too long. How many systems were impacted? Too many. How much revenue was lost? Too much. But that particular type of denial won’t help anyone avoid a similar outage from happening again at some point in the future.

That’s why it’s so amazing that Epic Games was entirely transparent about a certificate outage that impacted the company on April 6. In the spirit of openness and goodwill, the company shared their outage story with the world. In their own words, “It is embarrassing when a certificate expires, but we felt it was important to share our story here in hopes that others can also take our learnings and improve their systems.”

The company goes on to reveal in-depth details about why the outage happened, how big was the impact, and how long it took to fix. This is incredibly valuable information to help organizations everywhere understand why they need to take certificate management seriously. This level of sharing is downright…well…epic! And I applaud Epic Games for this heroic level of candor and downright altruism.

White Paper

CIO Study: Automation Vital to Address Shorter Lifespans and Massive Growth of TLS/SSL Certificates

Download Now

It’s bad enough when one system goes down. But what you will see in the story that Epic Games shares is that certificate outages often have unanticipated, critical impact on systems beyond those directly involved in the original outage. Epic Games outlines two additional areas of substantial impact beyond the initial outage triggered by the expired certificate:

An expired certificate caused an outage across a large portion of internal back-end service-to-service calls and internal management tools
Unexpected, significant increases of traffic to the Epic Games Launcher, disrupted service for the Epic Games Launcher and content distribution features
An incorrect version of the Epic Games Store website referencing invalid artifacts and assets was deployed as part of automatic scaling, degrading the Epic Games Store experience

It’s hard to imagine a more careful complete summary of the impacts of certificate outages. Many companies choose to overlook the peripheral impacts. In this case, over 25 critical staff members were pulled away from other pressing duties to repair the damage. Millions of connections were disrupted. And thousands (not quantified) of frustrated customers were offered invalid content from the company’s online store. This brings concrete meaning to otherwise vague terms like lost revenue, diverted productivity, customer dissatisfaction and brand damage.

But the relatively mild user irritation caused by a few minutes of outage did not dissipate once the expired certificate was repaired. As I suspect is often the case, the impact lasted much longer than anyone could have predicted. While the expired certificate was detected and replaced in a near record time (approximately 37 minutes), the aftermath lingered on for nearly 5 hours afterwards. Here’s the exact timeline that Epic Games shared:

12:00PM UTC - Internal certificate expired
12:06PM UTC - Incident reported and incident management started
12:15PM UTC - First customer messaging prepared
12:21PM UTC - Confirmation of multiple large service failures by multiple teams
12:25PM UTC - Confirmation the the certificate reissue process has started
12:37PM UTC - Certificate is confirmed to be reissued
12:46PM UTC - Confirmed recovery of some services
12:54PM UTC - Connection Tracking discovered as an issue for Epic Games Launcher service
1:41PM UTC - Epic Games Launcher service nodes restarted
3:05PM UTC - Connection Tracking limits increased for Epic Games Launcher service
3:12PM UTC - First signs of recovery of Epic Games Launcher service
3:34PM UTC - Epic Games Store web service scales up
3:59PM UTC - First reports of missing assets on Epic Games Store
4:57PM UTC - Issue with mismatched versions of Epic Games Store web service discovered
5:22PM UTC - Epic Games Store web service version corrected
5:35PM UTC - Full recovery

Now that is an afternoon that I would not wish on anyone. But congratulations on a successful resolution. So, how can you be sure that this won’t happen to your organization? First, as Epic Games now does, you need to recognize the critical importance of each and every digital certificate that acts as a machine identity anywhere in your network. You need to know how many you have, where they are being used, and…yes…when they will expire. Once you are armed with that information, you can safely automate the entire certificate lifecycle so that there will be no nasty surprises.

Venafi offers a comprehensive platform for machine identity management that has helped the world’s leading companies keep track of their certificates and avoid outages. In fact, based on the lessons we’ve learned from working with 400+ global customers, we’ve created a proven, 8-step methodology that combines people, process and technology. If you follow this blueprint, we guarantee that you can stop TLS certificate-related outages forever.

Tired of worrying when your next certificate outage will hit? Contact us.

Control Plane

Take Control of Your Machine Identities With Automation and ELIMINATE Outages!

Start now.

Related posts

Anatomy of a Certificate Outage [Epic Games]

CIO Study: Automation Vital to Address Shorter Lifespans and Massive Growth of TLS/SSL Certificates

Take Control of Your Machine Identities With Automation and ELIMINATE Outages!

Products

Solution Offerings

Consulting

Resources

Company