Certificate outages can have devastating effects on your websites and web applications! Not only does a certificate outage mean interruptions in your ability to encrypt data in transit, it also causes downtime overall. Here are some terrifying stories of TLS certificate outages that have happened to other companies. Similar incidents could also happen to your organization!
Microsoft Teams: as written by Shelby Brown of CNET in February:
“Microsoft Teams, a workplace collaboration app designed to rival Slack, suffered an outage Monday after Microsoft failed to renew its authentication certificate. The tool was down for about three hours while Microsoft investigated and updated the certificate, according to the company's Twitter account. By around 9 a.m. PT Monday, the service was back up for most users though the issue wasn't fully resolved until later in the day. ‘We've addressed an access issue that some customers may have experienced, and service is fully restored,’ a Microsoft spokesperson told CNET in an email Monday.”
Government websites: from Kris Holt via Engadget over a year ago:
“Agency websites are among the many facets of the US government that the ongoing shutdown has affected, as more than 80 TLS certificates on government sites have reportedly expired. Even though federal employees could have renewed them well in advance of the shutdown, there's no one around to do so now, meaning dozens of sites may be inaccessible or non-secure for the time being.
NASA, the Department of Justice and the Court of Appeals are among those whose sites have been affected, and the expired certificates have impacted services including payment portals, according to Netcraft.”
O2: courtesy of yours truly in 2018:
“When about 32 million people in the UK lost the use of 4G and SMS on December 6th, I could definitely feel their pain. That’s a major inconvenience to people in their everyday lives, and also to many businesses which rely on their phones.
The outage affected O2 customers, and also customers of other Telefonica U.K. carriers, which include GiffGaff, Lyca Mobile, Sky Mobile, and Tesco Mobile. The common link is Ericsson’s Serving GPRS Support Node—Mobility Management Entity software. Ericsson was making changes to their Ericsson's Centralized User Database of subscribers. And what was the point of failure? An expired certificate. A singular machine identity. Really!”
These are embarrassing sorts of problems that could happen to any organization, large or small. If and when they do happen, you must fix the problem as soon as possible! Otherwise you could have a lot of unhappy customers on your hands. But once the outage has been fixed, who’s to blame?
Well, let’s look at the Microsoft Teams incident first. Microsoft is easily one of the biggest tech companies in the world, and they’ve also greatly expanded and diversified their web services in the past couple of decades. The Microsoft Azure cloud platform, SharePoint, Hotmail, Office 365, and the list goes on. Microsoft Teams is one of their newest services, which launched in 2017. It integrates with Office 365 and allows companies to deploy workplace chat, video meetings, and file sharing. It’s like Slack in many ways, but compatible with the Microsoft ecosystem.
Microsoft’s massive size, in their web services alone, is definitely a factor in the Microsoft Teams certificate outage. If you’ve seen how often certificates expire, need revocation, or become lost in a typical web application, imagine how much more frequently that happens in Microsoft’s services! The integration with Office 365 and other Microsoft services is probably another factor. They’ve got some enormous web apps and they’re very much enmeshed in each other.
I would expect that Microsoft automates their certificate management a lot, but are all of their certificates managed through automation? There may be an entire department in Microsoft responsible for the certificate outage and they should definitely improve their certificate management if they haven’t already.
Next, there’s the American government shutdown story from early 2019. One of many unintended consequences of the shutdown was the plethora of public sector services, deployed through the web, that were hurt in the expiration of those eighty TLS certificates. Certificates expire all the time. They must expire so they can be replaced by new certificates, ensuring that no certificate is used for too long. Unfortunately, those certificates weren’t replaced in time and a lot of government services experienced dreadful downtime.
Who’s to blame here? It’s possible to avoid a certificate outage even if all of your employees stop going to work for a while. The US government should have been prepared to have systems that could generate new certificates to replace other certificates as they expire, even without direct human intervention. It’s fully understandable for workers to stop going to work if they aren’t getting paid. So, you may choose to blame the decision makers in the public sector who didn’t plan to automate certificate management properly. Oops!
Finally, there’s the Ericsson story I covered about a year and a half ago. A single expired certificate lead to lengthy downtime affecting several different wireless providers and a whopping 32 million customers! So, Ericsson made changes to their Centralized User Database of subscribers. All of your backend applications with grow and evolve overtime. Patches will be applied, software will be updated, and configurations will change. Your PKI must be ready for those inevitabilities.
Who do I blame here? It wouldn’t be fair to blame the people who were working directly on changing the database. Once again, the blame belongs to whoever decided to not automate certificate management properly. Just think of the headaches that could have been avoided with proper planning!
No one enjoys having fingers pointed at them when something goes wrong. So, here’s the moral of the story. If you want to keep your customers happy with maximum uptime and avoid the blame game, there’s one simple solution. Automate your certificate management properly and thoroughly!