As reported by Bleeping Computer and other sources, Google Voice experienced an outage caused by expired TLS certificates that affected a majority of users for several hours in mid-February 2021. Google Voice is an application that works on smartphone and computers and provides a phone number for calling, text messaging and voicemail using voice over IP (VoIP).
Google describes in their Issue Summary that they encrypt Google Voice traffic using Transport Layer Security (TLS) and rotate the TLS certificates and certificate configurations regularly. This outage was triggered when the active certificate in Google Voice frontend systems expired.
The outage lasted slightly more than four hours. In the Issue Summary, Google shares that engineers were immediately alerted of the outage, after which the team spent about two hours investigating and identifying the root cause of the problem and another two hours rolling out updated certificates and configuration information to affected systems. The Issue Summary doesn't specify what their engineers were doing during this time. Given our experience with other organizations, we expect they were looking in a number of areas. For example, was Google Voice down because of a problem with the system or service it runs on, did an application crash, was there a network problem or an issue with a related system or database.
At some point, the team honed in on the problem being an expired certificate. Again, we don't know the situation with Google Voice but there are causes we see again-and-again when certificates expire. The most common is probably the most obvious - organizations simply lose track of certificates because they're trying to track them manually, the processes aren't clear, people move positions, etc. We also frequently see things like people updating a certificate used on multiple systems but forgetting to update one of the systems or a certificate is updated on a load balancer but someone forgets the same certificate is on the application behind the load balancer. In this case, the cause might also be from a third-party certificate expiring given that it looks like Google Voice works with number of infrastructure partners and might be relying on those partners to keep their certificates up-to-date. Needless to say, there are many opportunities for a certificate to expire without being noticed!
In this situation, the outage was noticed when Google engineers received an alert. While the immediate alert is better than having an outage go unnoticed for some period of time, ideally there would have been an Outage Safety Net in place to warn them of the impending outage before it happened. In the Venafi Prescriptive Guide to Preventing Certificate-Based Outages, this is an important first step to put in place. Being able to notify teams who can quickly track down and remediate issues, like the Google team did, often results in issues being resolved much faster than trying to track down individual owners of certificates because the latter approach is often ineffective and time consuming.
While Google doesn’t describe why the certificate in their Google Voice frontend systems expired, at Venafi we know from working with many organizations who suffer similar outages there are some common challenges.
At a high level, most organizations today are trying to manage more machine identities, like TLS certificates, than ever before. In a study by Coleman Parkes that surveyed 550 CIOs from five countries: United States, United Kingdom, France, Germany and Australia, 97% of CIOs estimated that the number of TLS certificates used by their organization will increase 10–20% over the next year. In addition to growing volumes of certificates, certificates are needed faster and more frequently, especially in cloud and DevOps environments, while shorter certificate lifespans result in more regular renewals.
When you get to the next level of detail, the challenges go deeper. Multiple applications and devices often share certificates throughout the delivery process, which makes outage prevention difficult without reliable and repeatable processes in place. And simply getting certificates from the providers to the applications and systems that consume them, takes time and effort and is often error prone if the people involved do not have in-depth knowledge of how certificates work.
According to Kevin Bocek, Venafi Vice President of Security Strategy and Threat Intelligence, this is something that happens every day to Global 5000 businesses. “Certificates can take weeks to renew, and mistakes are often made. These mistakes can cause a service or application to go down for hours, days, and, in some cases, even longer. This is not a unique occurrence and can impact the world’s largest technology organizations. For example, Microsoft Azure and LinkedIn have experienced outages due to expired certificates in the past.”
He continued, “The problem is that most businesses and government agencies are using thousands of certificates, but they don’t have the insight or automation needed to replace certificates before they expire. An outage based on a failed certificate is really painful, not just for consumers but also for the IT and security teams trying to fix them. Finding an expired certificate manually is like looking for a very specific needle in a stack of needles.”
With these challenges and more, it’s not surprising that certificate-based outages are common. Certificate-based outages though aren’t inevitable. Venafi has helped global corporations successfully prevent site and service outages due to certificate expirations and misconfigurations. We have taken what we've learned about how people, processes and technology need to work together and documented it the Venafi Prescriptive Guide to Preventing Certificate-Based Outages mentioned earlier in this article. The eight steps outlined in this guide can help any organization eliminate certificate-based outages for good.