Grabbing the output of the Microsoft Speech API text-to-speech engine as audio data

Grabbing the output of the Microsoft Speech API text-to-speech engine as audio data

Rate This
  • Comments 3

A while ago I wrote a post on Implementing a "say" command using ISpVoice from the Microsoft Speech API which showed how to use Speech API to do text-to-speech, but was limited to playing the generated audio out of the default audio device.

Recently on the Windows Pro Audio forums, user falven asked a question about how to grab the output of the text-to-speech engine as a stream for further processing.

Here's how to do it.

The key part is to use ISpStream::BindToFile to save the audio data to a .wav file, and ISpStream::SetBaseStream to save to a given IStream. Then call ISpVoice::SetOutput with the ISpStream, prior to calling ISpVoice::Speak.

            ISpStream *pSpStream = nullptr;
            hr = CoCreateInstance(
                CLSID_SpStream, nullptr, CLSCTX_ALL,
                __uuidof(ISpStream),
                (void**)&pSpStream
            );
            if (FAILED(hr)) {
                ERR(L"CoCreateInstance(ISpVoice) failed: hr = 0x%08x", hr);
                return -__LINE__;
            }
            ReleaseOnExit rSpStream(pSpStream);
           
            if (File == where) {
                hr = pSpStream->BindToFile(
                    file,
                    SPFM_CREATE_ALWAYS,
                    &SPDFID_WaveFormatEx,
                    &fmt,
                    0
                );
                if (FAILED(hr)) {
                    ERR(L"ISpStream::BindToFile failed: hr = 0x%08x", hr);
                    return -__LINE__;
                }
            } else {
                // stream
                pStream = SHCreateMemStream(NULL, 0);
                if (nullptr == pStream) {
                    ERR(L"SHCreateMemStream failed");
                    return -__LINE__;
                }
               
                hr = pSpStream->SetBaseStream(
                    pStream,
                    SPDFID_WaveFormatEx,
                    &fmt
                );
                if (FAILED(hr)) {
                    ERR(L"ISpStream::SetBaseStream failed: hr = 0x%08x", hr);
                    return -__LINE__;
                }
            }
           
            hr = pSpVoice->SetOutput(pSpStream, TRUE);
            if (FAILED(hr)) {
                ERR(L"ISpVoice::SetOutput failed: hr = 0x%08x", hr);
                return -__LINE__;
            }

Updated source and binaries attached.

Usage:

>say.exe
say "phrase" [--file <filename> | --stream]
runs phrase through text-to-speech engine
if --file is specified, writes to .wav file
if --stream is specified, captures to a stream
if neither is specified, plays to default output

Here's how to generate a .wav file (uh.wav attached)

>say.exe "uh" --file uh.wav
Stream is 1

And here's how to generate an output stream. The app consumes this and prints the INT16 sample values to the console. uh.txt attached.

>say.exe "uh" --stream
Stream is 1
       0        0;        0        0;        0        0;        0        0
       0        0;        0        0;        0        0;        0        0
...
      86       86;    -1052    -1052;    -2839    -2839;    -3774    -3774
   -4199    -4199;    -4581    -4581;    -4284    -4284;    -3640    -3640
   -3100    -3100;    -2011    -2011;     -393     -393;      533      533
...

Attachment: say.zip
Leave a Comment
  • Please add 4 and 5 and type the answer here:
  • Post
  • Looks great!

  • Great article!

  • If I had to be nitpicky, I would say, however, not to use "where" as a variable name as it is a keyword.

Page 1 of 1 (3 items)