RDN Directions
A Genuine Failure
The servers may have been running, but legal copies of Windows couldn’t be activate -- by any normal definition of the word, that’s an outage.
Anyone who's spent time running a server -- let alone a data center full of servers -- knows that problems happen and systems go down. But when you're Microsoft and you're trying to convince millions of developers to use your Live services as a platform for future applications, the stakes are higher. Much higher. And the Microsoft Windows Genuine Advantage (WGA) fiasco shows that the company still has some work to do.
The whole problem started when someone at Microsoft deployed pre-production code to production WGA servers. (WGA is the piece of software that validates that a copy of Windows is genuine.) Errors in the code caused the servers to refuse to validate genuine systems.
Depending on the operating system they were using, users were presented with a pop-up message telling them that their copy of Windows was not genuine, leaving them either unable to access many downloads from the Microsoft Web site or, in the case of Windows Vista, cut off from certain OS features such as the Aero user interface. Even after uninstalling the pre-production code, Microsoft's technicians failed to notice that the system was still refusing to validate. Eventually, the problem was resolved but only after 12,000 users were incorrectly denied validation and effectively branded as pirates.
Failure of Process
The entire chain of events kicked off when pre-production code was installed on production servers. The lesson for IT developers and managers is to evaluate your own procedures for moving code from testing servers to production servers. In your organization, who has the authority to install code to production servers and what systems do you have in place to ensure that procedures are followed? What plans do you have in place to quickly move to backup systems?
For developers considering using Windows Live ID or another Windows Live service as part of your Web site or application, this incident with such an important service raises real concerns. Microsoft, along with many other Web services providers, offer little or no service-level guarantees and what's good enough for general Web use may not be good enough to build your application on.
Unfortunately, you don't get a chance to read a vendor's management policies or compare whether Microsoft or Yahoo! or Google has the best procedures in place to prevent problems. So events like this, or the recent Skype problems, are the only data points you have to work with. Unfortunately, this one didn't work in Microsoft's favor.
Failure of Design
More than simple human error was involved here. The problems experienced by the WGA servers were compounded by a poor design decision on Microsoft's part: WGA assumes that a copy of Windows is not genuine unless proven otherwise -- the benefit of the doubt goes to Microsoft, not the user.
That proved to be a critical design-time error, because it meant that the WGA service would not fail gracefully. Once WGA brands a system as non-genuine, Windows begins to turn off features. Initially, the user just gets a pop-up message and some Vista features (like Aero) turn off. Eventually, if the system can't be validated, it can degrade into a mode where no applications can be run until the OS is validated.
Most developers don't work on systems that try to deliberately disable features. But if you're building an application using the Live services, it's worth considering how your application should behave if those services aren't available.
Failure of Communication
Once the WGA system was restored, Microsoft's PR department engaged in spin control. It refused to call the problem an "outage" and went to great pains to say that no copies of Windows went into "reduced functionality mode." But those are effectively word games.
The servers may have been running, but legal copies of Windows couldn't be activated. By any normal definition of the word, that's an outage. And although no copies of Windows ever entered the most locked-down state possible, Vista users definitely got punished. The Aero user interface was turned off, ReadyBoost (a performance improving feature) was disabled, and Windows Defender only scanned for the most serious problems. Again, by the normal use of the phrase, that's reduced functionality.
Is your organization prepared to handle a problem, particularly if you work on a public-facing application or site? Can you speak clearly and honestly to customers without resorting to legalisms?
For an IT developer, you should insist on straightforward communication from Microsoft or any other service provider. If you don't get it, vote with your wallet.
WGA shows that building a platform requires a commitment to service: to prevent problems from occurring, to minimize the impact of them when they do occur, and to communicate to partners. Unfortunately, Microsoft has work to do on all three.
About the Author
Greg DeMichillie analyzes and writes about Microsoft's development platform and tools for Directions on Microsoft, a research firm dedicated to tracking Microsoft. He was previously the group program manager at Microsoft responsible for the overall design and feature set for Visual C# and C++. A founding member of the C# language team, DeMichillie was a key contributor to the initial design and development of .NET.