On March 15, Microsoft experienced a widespread outage in Azure Active Directory. According to Microsoft, the outage was caused by a “rotation of keys.” The result was a 14-hour outage that took down Office 365, Dynamics 365, Xbox Live, Teams, and additional third-party apps that depend on Azure for authentication.
We’ve said it before, and we’ll say it again: no business, big or small, is immune to outages if they are not managing the full lifecycles of their machine identities. When automation and full certificate visibility isn’t the backbone of your machine identity management strategy, error is inevitable. Let’s take a closer look at what exactly went wrong at Microsoft, and how you can avoid a similar occurrence with your organization using Venafi’s comprehensive platform for machine identity management.
CIO Study: Automation Vital to Address Shorter Lifespans and Massive Growth of TLS/SSL Certificates
What exactly went wrong with Microsoft Azure?
Officials from Microsoft have confirmed that on March 15th an “an error occurred in the rotation of keys used to support Azure AD's use of OpenID, and other, Identity standard protocols for cryptographic signing operations.”
As part of Microsoft’s standard security practices, an automated system eliminates redundant keys. According to Microsoft, for the last few weeks “a particular key was marked as 'retain' for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that 'retain' state, leading it to remove that particular key."
Once that key was removed, any app using Azure AD authentication immediately started rejected tokens that were signed with the removed key. The result? All Microsoft users that attempted to login to affected apps and third-party services were rejected.
While Microsoft did swiftly take action to mitigate the impact, the outage couldn’t be immediately reversed due to “different server implementations that handle caching differently”. It wasn’t until the affected apps had picked up the updated key metadata and refreshed their caches that users could regain access to their accounts.
On the outage, Microsoft released a statement expressing that they “understand how incredibly impactful and unacceptable this is and apologize deeply. "We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future."
VIA Venafi: 8 Steps to Stopping Certificate-Related Outages
How could Venafi have helped prevent this outage?
According to Michael Thelander, Venafi Director of Product Marketing, “Poorly orchestrated key rotation is the Achilles heel of modern digital transformation efforts; this oversight capable of bringing down entire applications and services in an instant.”
This is no isolated incident or freak occurrence. Outages, audit failures, lack of visibility… these are all the result of failing to secure and manage machine identities across your entire organization.
“Unfortunately, these kinds of outages will only continue until organizations adopt an enterprise-wide approach to managing the machine identities these keys and certificates represent.” Thelander comments.
“Digital transformation is not going to slow down, and this requires automation of keys and certificates found in workloads, containers, and across cloud environments as well as those in on-premises environments.”
To kickstart your digital transformation, learn more about how a single platform for enterprise-wide machine identity management can help you eliminate certificate-related outages for good!
Related posts
- Google Voice Left Speechless Due to Expired Certificates
- What If You Could Guarantee Eliminating Outages in Your Organization?
- Are Recent Certificate Outages a Sign of the Times? [Encryption Digest 51]
- 5 Reasons Your Certificates Keep Expiring