Microsoft’s hosted Exchange service wobbled badly this week, with an extended outage on Tuesday and more delays on Thursday. Messages were stuck in outboxes and there were long delays for mail in transit.
The head of the engineering department responsible for BPOS wrote a long apology and explanation this evening with details about the outages. He strikes an appropriately humble tone and promises to do better.
“[O]ver the last few days, we have not satisfied our customer’s needs. On Tuesday and today we experienced three separate service issues that impacted customers served from our Americas data center. All of these issues have been resolved and the service is now running smoothly. . . .
I’d like to apologize to you, our customers and partners, for the obvious inconveniences these issues caused. We know that email is a critical part of your business communication, and my team and I fully recognize our responsibility as your partner and service provider.”
Microsoft’s hosted services have more than 40 million customers worldwide, including Fortune 500 companies and government agencies. When the mail isn’t delivered instantly, you know there are a lot of Very Important People spitting fire and beating up on Microsoft’s support desk. I would guess that there will be a lot of meetings in the next few days trying to figure out how in the world to keep things like this from happening again.
An interesting statistic in the article provides a small glimpse into the scale of the service. Between 11:35am and 12:04pm PDT, more than 1.5 million messages queued up on the service awaiting delivery. It’s not clear whether that was the total volume of mail or just the ones that were delayed but either way, that’s a lot of messages in 29 minutes.
If you’re relying on Microsoft Online Services, you should bookmark the Service Health Dashboard at https://health.noam.microsoftonline.com (under Computers / MS Online Services on the Bruceb Favorites page). Microsoft promises to use the dashboard to do a better job of providing useful information about system issues.
As a result of Tuesday’s incident, we feel we could have communicated earlier and been more specific. Effective today, we updated our communications procedures to be more extensive and timely. We understand that it is critical for our customers to be as fully informed as possible during service impacting events. We will continue to improve the timeliness and specificity of our communications. The primary mechanism for communicating to our customers on issues has been and will continue to be the Service Health Dashboard.
Let’s hope things settle down!