Skip to main content
banner image
venafi logo

Anatomy of a Certificate Outage [Epic Games]

Anatomy of a Certificate Outage [Epic Games]

what-happens-in-a-certificate-outage
April 21, 2021 | Scott Carter

Everyone knows that certificate outages are painful. Just ask anyone who has had to deal with the tangled aftermath of an expired certificate. There are so many unknowns. And so many unanticipated consequences. And that’s perhaps why, when it comes to measuring the specifics of just how bad a given outage was, the details often get blurred by the post traumatic stress. So it’s hard to get answers that quantify the impact. How long was the outage? Too long. How many systems were impacted? Too many. How much revenue was lost? Too much. But that particular type of denial won’t help anyone avoid a similar outage from happening again at some point in the future.

That’s why it’s so amazing that Epic Games was entirely transparent about a certificate outage that impacted the company on April 6. In the spirit of openness and goodwill, the company shared their outage story with the world. In their own words, “It is embarrassing when a certificate expires, but we felt it was important to share our story here in hopes that others can also take our learnings and improve their systems.”

The company goes on to reveal in-depth details about why the outage happened, how big was the impact, and how long it took to fix. This is incredibly valuable information to help organizations everywhere understand why they need to take certificate management seriously. This level of sharing is downright…well…epic! And I applaud Epic Games for this heroic level of candor and downright altruism.

 

It’s bad enough when one system goes down. But what you will see in the story that Epic Games shares is that certificate outages often have unanticipated, critical impact on systems beyond those directly involved in the original outage. Epic Games outlines two additional areas of substantial impact beyond the initial outage triggered by the expired certificate:

  1. An expired certificate caused an outage across a large portion of internal back-end service-to-service calls and internal management tools
  2. Unexpected, significant increases of traffic to the Epic Games Launcher, disrupted service for the Epic Games Launcher and content distribution features
  3. An incorrect version of the Epic Games Store website referencing invalid artifacts and assets was deployed as part of automatic scaling, degrading the Epic Games Store experience

It’s hard to imagine a more careful complete summary of the impacts of certificate outages. Many companies choose to overlook the peripheral impacts. In this case, over 25 critical staff members were pulled away from other pressing duties to repair the damage. Millions of connections were disrupted. And thousands (not quantified) of frustrated customers were offered invalid content from the company’s online store. This brings concrete meaning to otherwise vague terms like lost revenue, diverted productivity, customer dissatisfaction and brand damage.

But the relatively mild user irritation caused by a few minutes of outage did not dissipate once the expired certificate was repaired. As I suspect is often the case, the impact lasted much longer than anyone could have predicted. While the expired certificate was detected and replaced in a near record time (approximately 37 minutes), the aftermath lingered on for nearly 5 hours afterwards. Here’s the exact timeline that Epic Games shared:

  • 12:00PM UTC - Internal certificate expired
  • 12:06PM UTC - Incident reported and incident management started
  • 12:15PM UTC - First customer messaging prepared
  • 12:21PM UTC - Confirmation of multiple large service failures by multiple teams
  • 12:25PM UTC - Confirmation the the certificate reissue process has started
  • 12:37PM UTC - Certificate is confirmed to be reissued
  • 12:46PM UTC - Confirmed recovery of some services
  • 12:54PM UTC - Connection Tracking discovered as an issue for Epic Games Launcher service
  • 1:41PM UTC - Epic Games Launcher service nodes restarted
  • 3:05PM UTC - Connection Tracking limits increased for Epic Games Launcher service
  • 3:12PM UTC - First signs of recovery of Epic Games Launcher service
  • 3:34PM UTC - Epic Games Store web service scales up
  • 3:59PM UTC - First reports of missing assets on Epic Games Store
  • 4:57PM UTC - Issue with mismatched versions of Epic Games Store web service discovered
  • 5:22PM UTC - Epic Games Store web service version corrected
  • 5:35PM UTC - Full recovery

Now that is an afternoon that I would not wish on anyone. But congratulations on a successful resolution. So, how can you be sure that this won’t happen to your organization? First, as Epic Games now does, you need to recognize the critical importance of each and every digital certificate that acts as a machine identity anywhere in your network. You need to know how many you have, where they are being used, and…yes…when they will expire. Once you are armed with that information, you can safely automate the entire certificate lifecycle so that there will be no nasty surprises.

Venafi offers a comprehensive platform for machine identity management that has helped the world’s leading companies keep track of their certificates and avoid outages. In fact, based on the lessons we’ve learned from working with 400+ global customers, we’ve created a proven, 8-step methodology that combines people, process and technology. If you follow this blueprint, we guarantee that you can stop TLS certificate-related outages forever.

Tired of worrying when your next certificate outage will hit? Contact us.
 

Related posts

 

Like this blog? We think you will love this.
eliminate-outages-with-machine-ide
Featured Blog

Why Stopping Certificate Outages Starts with an Outage Safety Net

We’ve also talked a lot in this blog about how to eliminate outages.

Read More
Subscribe to our Weekly Blog Updates!

Join thousands of other security professionals

Get top blogs delivered to your inbox every week

See Popular Tags

You might also like

TLS Machine Identity Management for Dummies
eBook

TLS Machine Identity Management for Dummies

Certificate-Related Outages Continue to Plague Organizations
White Paper

CIO Study: Certificate-Related Outages Continue to Plague Organizations

About the author

Scott Carter
Scott Carter

Scott is Senior Manager for Content Marketing at Venafi. With over 20 years in cybersecurity marketing, his expertise leads him to help large organizations understand the risk to machine identities and why they should protect them

Read Posts by Author
get-started-overlay close-overlay cross icon
get-started-overlay close-overlay cross icon
Venafi Risk assessment Form Image

Sign up for Venafi Cloud


Venafi Cloud manages and protects certificates



* Please fill in this field Please enter valid email address
* Please fill in this field Password must be
At least 8 characters long
At least one digit
At last one lowercase letter
At least one uppercase letter
At least one special character
(@%+^!#$?:,(){}[]~`-_)
* Please fill in this field
* Please fill in this field
* Please fill in this field
*

End User License Agreement needs to be viewed and accepted



Already have an account? Login Here

×
get-started-overlay close-overlay cross icon

How can we help you?

Thank you!

Venafi will reach out to you within 24 hours. If you need an immediate answer please use our chat to get a live person.

In the meantime, please explore more of our solutions

Explore Solutions

learn more

Email Us a Question

learn more

Chat With Us

learn more