Give Applications a Voice
Speech synthesis and recognition in .NET
- By Michael Dunn
Speech synthesis and recognition have been around for a while, but have yet to become "mainstream" in most applications. If you're planning to use speech recognition in your application you need to choose the right tool for the job. With Microsoft releasing new products in this field, it's easy to get confused about what product to use and when.
Office Communication Server 2007
Office Communication Server has been portrayed simply as the new version of Live Communication Server. It's actually a combination of Live Communication Server for instant messaging, and Microsoft Speech Server for telephony.
The merger of these two products fits into Microsoft's Unified Communications plan. Unified Communications is the idea where your e-mail, telephone, instant messaging, faxes and voice mail can be accessed and placed from a single location. For example, when running Exchange Server 12, also currently in beta, with Office Communication Server, users can listen and respond to e-mails and calendar appointments over the phone.
Office Communication Server's Speech Platform Services also allows you to develop customer-facing telephony applications, such as an Interactive Voice Response (IVR) application used to make flight reservations. While Microsoft Speech Server 2004 only supported Speech Application Language Tags (SALT), a Microsoft-founded standard, Office Communication Server will not only support SALT but also its rival VoiceXML. Microsoft has also created a new project type, Speech Workflow Applications. Speech Workflow Applications allows you to write an IVR application using a managed language and Windows Workflow Foundation.
The advantage that Speech Workflow Applications has over SALT and VoiceXML is that it allows you to visually see the call flow and there's no need to learn an additional markup language. Developers already familiar with Windows Workflow Foundation and a .NET language can easily and rapidly develop IVR applications using existing knowledge.
Microsoft Speech Server 2004 was not a widely known product to the .NET development community. While a Microsoft customer might use .NET for their Web and Windows development, their IVR applications were still being developed on other platforms and in different programming languages. Office Communication Server 2007 will give .NET developers the ability to create sophisticated telephony and instant messaging applications using the familiar syntax of either C# or VB.Net. By itself, Microsoft Speech Server 2004 was a great product, but it didn't have a lot of success in terms of numbers of users. Hopefully, repackaging it
into Office Communication Server will breathe new life into this product.
Windows Vista, SAPI 5.3 and .NET 3.0
In Windows Vista, one of the first features you will notice is the new Microsoft Voice, Microsoft Anna, who is replacing the Windows 2000 and
XP default voice, Microsoft Sam. With the retirement of Microsoft Sam, Anna brings a more pleasant and natural-sounding voice; her voice was created from actual voice recordings, unlike previous Microsoft voices.
The .NET 3.0 Framework has included a managed speech API, System.Speech. This allows you to rapidly create speech-enabled Windows applications for Windows Vista using Visual Studio 2005. As with all versions of SAPI, this version is dependent on the operating system. SAPI 5.3 is only available on Windows Vista. As with previous versions of SAPI, your application can run on earlier versions, such as Windows XP's SAPI 5.1, however, if your application uses any features specific to SAPI 5.3, expect a not-supported error.
The two main namespaces to become familiar with for .NET speech-enabled applications are System.Speech.Synthesis and System.Speech.Recognition.
The System.Speech.Synthesis namespace can be used to access the SAPI synthesizer engine to render text into speech using an installed voice, such as Microsoft Anna.
The SAPI 5.3 synthesizer now supports the W3C standard Speech Synthesis Markup Language (SSML), a markup language that allows you to finely tune how the
synthesizer will produce words, such as pronunciation, speed, volume and pitch, of the produced phrase.
The System.Speech.Recognition engine is used to recognize a user's voice and convert it into text. The SAPI 5.3 recognition engine now supports the W3C standard; Speech Recognition Grammar Specification (SRGS), a markup language that defines how and what words are recognized. SAPI 5.3 also added support for Semantic Interpretation. For example, a grammar might define yes or no as acceptable answers. Using Semantic Interpretation, a user could say "no," "nope," "not" and the semantic value for each of these phrases would be "no." This makes life a lot easier for the developer as they only have to check the semantic result for "no."
Microsoft is releasing an array of new products in 2007 that will directly and indirectly support speech and speech recognition development in .NET. If you're planning to develop a Windows application with speech synthesis and recognition capabilities, you should use the .NET 3.0 System.Speech namespace. If your application is going to be running over a telephone, you should use the Speech Platform Services in Office Communication Server 2007, formerly known as Microsoft Speech Server.
Michael Dunn is a consultant with Magenic Technologies, a Microsoft-centric consulting firm. He can be reached at firstname.lastname@example.org.