Skip to main content
banner image
venafi logo

The Pain of Certificate-Related Outages Is Very Real [And Completely Avoidable]

The Pain of Certificate-Related Outages Is Very Real [And Completely Avoidable]

pain of certificate outages
July 18, 2019 | John Muirhead-Gould

Working on the client side of Machine Identity Management, I have witnessed the aftermath of expensive certificate-related outages. These incidents convinced me of that before an organization can know “Who” is on the network, it must first be established “What” is on the network.

Early in my career, I was working 12-hour shifts in the data center. I remember one night at 10:00 PM, all these people show up. They're very anxious, because they needed to reset the master password on the HP Atalla HSM that was used to protect debit or credit card transaction flows. And that type of outage means real costs in terms of revenue lost.

Later on, one of the teams I ran would regularly receive hundreds of incident tickets and implement dozens of change controls on a weekly basis. P1 tickets were the most severe: “the server is down. What happened? It’s a potential $500,000 per hour loss or a potential 500+ user impact”. These tickets would require immediate investigation to restore service.


So I’ve heard my fair share of outage horror stories. And here are the types of events that you’ll want to do everything in your power to avoid. Let’s say you’re an AIX Administrator who accidentally zeroed out the /etc/passwd file on 500 servers. You want to talk outrage, right? The responding team would literally have to go and call tape media back and very rapidly become fluent at how to do full Unix system restores—all while the clock is ticking on untold millions of dollars of outage. Despite your best intentions to automate a routine process, you may end up losing your job over a situation like this.

There’s nothing like the panic you feel when an incident has gone all the way through a variety of different support teams, unresolved. Here’s a particularly bad situation: nobody can log into your UNIX/Linux servers, creating a cascading and disruptive effect. Cases like this often take hours of investigation to finally establish what had gone wrong. After hours of arduous detective work, you trace the problem was back to people’s passwords in LDAP. LDAP is highly available and because the LDAP infrastructure hosts sensitive encrypted passwords, they needed to be replicated using strong encryption. As soon as that certificate expired, LDAP authentication requests were denied, and nobody could login to UNIX/Linux. Only when the certificate is finally replaced with a renewed version will everything start working again.


The bottom line is that no one really wants to be called in to pinch hit for a particularly challenging outage. However, despite their best efforts, your operations teams will occasionally get an incident they can't resolve on their own. So, it bubbles its way up from the level one support people to the level two support people to the level three support people. And then, eventually hit your desk and you have to divert your staff to an emergency fix.

Over the years I’ve been exposed to a fair share of pain from certificate related outages on both sides of the client/vendor line. Certificates may seem like such a routine part of our security regimen that it’s easy to underplay their significance. I’ve learned that you only have to lose control of one certificate and the entire organization can feel the pain. And the average enterprise uses hundreds of thousands, if not millions, of certificate instances. And that’s one of the reasons why I’m here at Venafi—to help people avoid the pain of certificate-related outages.

Do you have visibility of your entire inventory of machine identities?


Related posts

Like this blog? We think you will love this.
Featured Blog

Why Stopping Certificate Outages Starts with an Outage Safety Net

We’ve also talked a lot in this blog about how to eliminate outages.

Read More
Subscribe to our Weekly Blog Updates!

Join thousands of other security professionals

Get top blogs delivered to your inbox every week

See Popular Tags

You might also like

TLS Machine Identity Management for Dummies

TLS Machine Identity Management for Dummies

Certificate-Related Outages Continue to Plague Organizations
White Paper

CIO Study: Certificate-Related Outages Continue to Plague Organizations

About the author

John Muirhead-Gould
John Muirhead-Gould

John is a Strategic Solution Architect with Venafi, whose interests and experience encompass Business Intelligence and Analytics, Cloud Services and Solutions, Digitalization and Digital Marketing, and Cybersecurity and Information Security.

Read Posts by Author
get-started-overlay close-overlay cross icon
get-started-overlay close-overlay cross icon
Venafi Risk assessment Form Image

Sign up for Venafi Cloud

Venafi Cloud manages and protects certificates

* Please fill in this field Please enter valid email address
* Please fill in this field Password must be
At least 8 characters long
At least one digit
At last one lowercase letter
At least one uppercase letter
At least one special character
* Please fill in this field
* Please fill in this field
* Please fill in this field

End User License Agreement needs to be viewed and accepted

Already have an account? Login Here

get-started-overlay close-overlay cross icon

How can we help you?

Thank you!

Venafi will reach out to you within 24 hours. If you need an immediate answer please use our chat to get a live person.

In the meantime, please explore more of our solutions

Explore Solutions

learn more

Email Us a Question

learn more

Chat With Us

learn more