Microsoft Corporation, Xbox network, Microsoft Azure, Microsoft Teams, Cloud computing, Office 365, Microsoft Office his Week: Azure AD Outage – What Happened?
At least it was until march 15, when azure had a significant outage that not only affected azure services but also office teams, dynamics, 365, xbox live and more so what happened this week on azure this week? We explore just that what happened so stay tuned for the analysis. We’Ve come to think of cloud computing as the magic source that holds together our mobile apps, our email access and the authentication of users for company infrastructure, the cloud never sleeps it never stops computing. It never stops charging yet. On march 15th, azure had an outage that saw widespread consequences for millions of users. 14 hours of various outages on the second largest cloud computing platform was not part of the plan. Well, obviously, so what happened at 9 pm, utc time or 9 pm uk time same thing, a routine rotation of security keys for azure active directory was performed in itself. This is a good thing, as the rotation keeps users and tenants, safe and fresh. On this day, though, microsoft was performing a rather complex migration of data between cloud providers, so they had marked one specific key as don’t rotate. They needed this key to remain for the time being to finish the migration. Now the automatic rotation process ignored this. As the new keys reached, azure services users were not able to log in, as the system was using, the old key, the one that was marked don’t rotate. Imagine the company swipe cards for all the staff were changed overnight, but nobody told anyone and when everybody shows up to the office the next day, no one could get in.
If you remember, back to 28 september 2020, azure had a similar outage that funnily enough also affected azure active directory. So why wasn’t it fixed, then you might ask, ironically, this new outage was in part due to azure fixing the root cause of the september outage. Now the work being done hadn’t completed yet when these major changes do take time, but it would also have prevented this recent outage. So why did it happen? It was something as simple as a software bug that failed to acknowledge a flag, saying please don’t rotate. This key, and even though a fix was implemented within a couple of hours, it took a lot longer for services to clear their caches and the old key being propagated to all corners of azure. How can we then be sure that this won’t happen again? Of course, software will always have bugs, or as one of my mentors like saying, all software always have an infinite number of unknown bugs. However, microsoft was quick to state that we understand how incredibly impactful and unacceptable this incident is and apologize deeply. We are continuously taking steps to improve the microsoft azure platform and our process to help ensure such incidents do not occur in the future and they are really taking big steps to prevent any future outages. As microsoft explains it. Azure ad is undergoing a multi phase. Effort to apply additional protections to the backend safe deployment process, so that’s the internal deployment process in order to ensure that this doesn’t happen again yay.
Now these kinds of outages really shouldn’t happen, though, like platforms as mature as azure should have a million bug catching facilities and mediation functions. But here we are, we saw recently both aws and gzp have outages too. So it is something we have to kind of expect to some degree, while cloud computing continues to provide way more benefits than drawbacks. Even with global outages like this one, us class developers should be able to build out applications a little more robust when these things occur. Something to keep in mind if you want to read the others notification for microsoft yourself. The link is in the description. So what do you think about the azure outage and how microsoft handle it? Let me know, in the comments give us a thumbs up if you liked this episode and subscribe to the acg channel for much more cloud content. As we say on the a cloud guru team, when we got a new team member and have to show them how this show sausage is made, seek and use your cloud so see you next week and keep giving us thumbs up.