Nearly every organization struggles with certificate-related outages. For people that don’t work with PKI everyday managing TLS certificates seems like it should be very straight forward, but even large organizations with strong IT and security practices fall victim to certificate outages regularly.
I have been at Venafi for almost 9 years and during that time I’ve worked with clients from around the world. Before that, my focus was network and systems management and network operations, so I’ve been in the trenches both as a vendor and as a team member trying to keep systems up and operating reliability.
I’ve seen a lot of organizations from all kinds of industries in various stages of maturity when it comes to managing and securing their machine identities. At this point I’m pretty much able to predict the challenges, struggles and pains that an organization is having and going to have based on the maturity level of their machine identity management program.
These are real world stories that I have personally seen multiple times while working with organizations all over the world.
Joe requested a certificate, so Joe’s email address is listed as the owner of the cert. An email was sent to Joe 30 days before the certificate was set to expire.
Certificate Management via Spreadsheet / Wiki / SharePoint
Susan created a spreadsheet to track certificates. When someone requests a certificate, Susan logs the cert name, requestor and expiration date. Every week Susan generates a report to identify the certs expiring within 30 days. She sends an email to the owner to let them know the cert is going to expire.
I know what you are thinking right now. “Come on Venafi, of course she gets vacation (or sick leave or whatever). No organization is going to place such an important task on just one person.” OR you might be thinking “Dang! This Venafi guy knows exactly what my life is like. This is what I deal with every day.” If your organization is trying to manage certificates using a static list that is manually maintained it doesn’t really matter which belief you have, you will fail and eventually one of those failures will be significant.
You know that spreadsheet or SharePoint or wiki that Susan at Company X created to track certs? Or you know some other system or tool that Tony at Company Y uses to track his certs? What if it doesn’t track where the cert is installed? For that matter, how do either Susan or Tony know where any cert is installed?
Tony has a form that he uses for certificate requests. Susan uses tickets. The form and ticket ask the requestor to provide the information on where the certificate will be installed. So now 30 days before the certificate is going to expire Susan and Tony both send email notifications to the cert owner. Susan even can open a ticket to let the owner know the cert is going to expire. In the ticket and notification, it tells the owner where the cert is installed based on the info, they provided 2 years ago (1 year ago beginning this Sept. But that’s another story that will complicate Susan’s and Tony’s lives even more). The owner responds and says they need to renew the cert. Tony and Susan both follow the processes for their respective organizations and provide a renewed cert to the owner well before the certificate expires. The countdown begins: 20 days, 10 days, 5, 4, 3, 2, 1. OUTAGE. What the heck happened?
He said / She said
This is not always the blame game. Sometimes, maybe even most of the time, this is a communication or process issue. Here’s what goes wrong:
Company A is a hosting provider of some sort. Their customers need to use certificates to access Company A’s services. In some scenarios the customer is responsible for the cert and others Company A might be responsible for cert generation. In either case, if the cert is not managed, monitored and secured properly there will be an outage. And guess what? Even if the customer was responsible for the cert, it will still be Company As fault the cert expired because it is their service the customer is using, and the customer is always right.
In some organizations the app team is responsible for the certs their apps are consuming. In other organizations the device owners are responsible for the certs installed on their devices. In some organization the SecOps team is responsible. In other organizations it’s a mix. Who gets notified? Who must approve this spend? In these mixed responsibility situations, each potential owner thinks things like:
There are endless variations on this theme - 9t’s easy to see how this can become confusing.
Restarting services / daemons / bindings
App owners and Ops teams are busy. Their days are filled with tasks to deploy new things and keep everything else running. Installing certs is not something that they do every day. So, when they get notified that a cert is going to expire soon, they follow the corporate process to get the cert renewed. Once the cert is renewed, they need to install it. They copy the cert and key into the appropriate location and assume all is well.
25 days later there’s an outage with a severity 1 ticket. The app owner or ops team checks the database. Nothing. They check the network. Nothing. They check the VM. Nothing. They check physical. Nothing. They check the app stack. Nothing. The check all the logs. Nothing. (If this happens on a critical system everyone’s blood pressure is ticking up a notch or two by this point).
At this point someone says, “Wait, isn’t this the system where we just renewed the cert?” Turns out someone copied the new cert to the system but didn’t do the final binding and/or restart services. Because these things didn’t happen the original cert was still in operation when it expired so they had an outage.
For organizations without a strong machine identity management program these fundamental problems tend to show up; regardless of the type of organization, their business model and how they use of machine identities.
If reading about these issues gives you a strong sense of de ja vu, and you’d like to figure out how to solve these problems once and for all, check out our approach. It’s helped many of customers eliminate certificate related outages completely.