Microsoft yesterday experienced an outage in online services including Teams, M365 and Outlook

According to Bloomberg News, Microsoft yesterday experienced an outage in online services including Teams, M365 and Outlook.

This comes on the heels of Microsoft’s positive earnings report on Tuesday, but contrasts with the company’s announcement of a 5% headcount cut, which would lay off 10,000 workers. The layoffs included members of Microsoft’s cloud services offering, Azure, the company’s revenue growth engine. Azure is Microsoft’s growth engine, but growth across the cloud services industry has slowed, signaling the maturity of the cloud services industry.

Azure was at the center of the outage on Tuesday, and Microsoft provided an impact summary to the Azure state history site, continuing its track record to uncover the root cause of the outage. The outage in multiple regions lasted 3 hours and affected Azure resources in Public Azure.

region. Popular services M365 and PowerBI were also affected.

According to Microsoft’s own disclosure of the issue, a Wide Area Network (WAN) issue was the cause of the outage. The company changed its WAN, breaking the connection between the Internet and Microsoft’s core suite of services.

The US Federal Aviation Administration (FAA) also experienced an outage last week in an important pilot safety notification system known as NOTAM. And the outages are due to system changes. According to the FAA, the outage was caused by corrupted files in the primary and secondary databases. When the contractor deleted that file, domestic flights across the United States were grounded as systems slowed down and pilots were unable to use NOTAM alerts.

Downtime remains a significant drawback as the FAA relies on outdated systems as its reliance on cloud service providers increases.

Although the causes of the two disruptions are diverse, the widespread impact is a common feature of these disruptions and all disruptions in major organizations.

Regardless of the source, the financial impact of a system outage cannot be overemphasized. The Uptime Institute found that outages costing $100,000 or more increased to more than 60% of all connectivity failures (up from 39% in 2019). And more businesses are paying more than $1 million to survive the aftereffects of the outage, with companies paying seven-digit numbers, up 15% from 11% in previous years.

According to the report, Azure is the second largest Cloud Service Provider (CSP) and the creator and market leader of the CSP segment, Amazon.

Microsoft promises to provide a full root cause analysis or post-mortem report (PIR) within the next 3 days and a final PIR 14 days thereafter.

We spoke with Chip Gibbons, CISO of managed services company Thrive, to learn about post-outage mitigation plans. Highlights include:

Planning is essential for any size company. Many businesses can benefit from a comprehensive data backup and recovery plan with relative ease. Larger organizations may have to deal with more details, especially system recovery methods, applications and working conditions. However, certain aspects of data recovery always need to be addressed, such as how the backup system works, who is responsible for it, the recovery point objective (RPO), and how much data needs to be backed up.

This can significantly reduce the time it takes to resume operations after a disaster to meet your specified Recovery Time Objective (RTO).

Routine testing of your DR strategy – Testing is essential, but it can disrupt business operations and potentially reduce productivity. Each time a system is tested, the IT team must discover what is wrong with the DR strategy and adjust it over time as these issues are addressed. When these issues are adequately addressed during the testing phase, organizations have a better chance when it comes to truly leveraging their DR strategy.

Remember, IT infrastructure is managed by people. Therefore, DR strategies must consider human behavior. For example, if a company’s location is damaged by a disaster, organizations need to ensure that employees have access to the data they need to do their jobs effectively.