Question

在C＃的文本到语音应用程序中，我使用SpeechSynthesizer类，它有一个名为SpeakProgress的事件，它会针对每个口语单词触发。但是对于某些声音，参数e.AudioPosition与输出音频流不同步，输出波形文件的播放速度比此位置显示的速度快（参见this related question）。

无论如何，我试图找到有关比特率和与所选语音相关的其他信息的确切信息。正如我所经历的那样，如果我可以使用此信息初始化wave文件，则将解决同步问题。但是，如果我在SupportedAudioFormat中找不到这样的信息，我就知道找不到其他方法了。例如，“Microsoft David Desktop”语音在VoiceInfo中不提供支持的格式，但它似乎支持PCM 16000 hz，16位格式。

如何找到SpeechSynthesizer所选语音的音频格式

 var formats = CurVoice.VoiceInfo.SupportedAudioFormats;

 if (formats.Count > 0)
 {
     var format = formats[0];
     reader.SetOutputToWaveFile(CurAudioFile, format);
 }
 else
 {
        var format = // How can I find it, if the audio hasn't provided it?           
        reader.SetOutputToWaveFile(CurAudioFile, format );
}

Answer 1

更新：此调查已在调查后进行了修改。最初我从内存中建议SupportedAudioFormats可能只是来自（可能是错误配置的）注册表数据;调查显示，对于我来说，在Windows 7上，情况确实如此，并且在Windows 8上进行了备份。

SupportedAudioFormats

的问题

System.Speech包装了令人尊敬的COM语音API（SAPI），有些语音是32比64位，或者可能配置错误（在64位机器的注册表中，HKLM/Software/Microsoft/Speech/Voices vs HKLM/Software/Wow6432Node/Microsoft/Speech/Voices。

我已经在System.Speech及其VoiceInfo类指向ILSpy，我非常确信SupportedAudioFormats完全来自注册表数据，因此在枚举{{1}时可能会得到零结果如果你的TTS引擎没有为你的应用程序的平台目标（x86，Any或64位）正确注册，或者供应商根本没有在注册表中提供这些信息。

声音可能仍然支持不同的，额外的或更少的格式，因为这取决于语音引擎（代码）而不是注册表（数据）。所以它可以在黑暗中拍摄。在这方面，标准Windows语音通常比第三方语音更加一致，但它们仍然不一定有用地提供SupportedAudioFormats。

艰难地找到这些信息

我发现仍然可以获得当前语音的当前格式 - 但这确实依赖于反射来访问System.Speech SAPI包装器的内部。

因此，这是非常脆弱的代码！我不建议在生产中使用。

注意：以下代码确实要求您为设置调用一次Speak（）;如果没有Speak（），则需要更多调用来强制设置。但是，我可以打电话给SupportedAudioFormats什么也不说，这样就可以了。

实现：

Speak("")

用法：

[StructLayout(LayoutKind.Sequential)]
struct WAVEFORMATEX
{
    public ushort wFormatTag;
    public ushort nChannels;
    public uint nSamplesPerSec;
    public uint nAvgBytesPerSec;
    public ushort nBlockAlign;
    public ushort wBitsPerSample;
    public ushort cbSize;
}

WAVEFORMATEX GetCurrentWaveFormat(SpeechSynthesizer synthesizer)
{
    var voiceSynthesis = synthesizer.GetType()
                                    .GetProperty("VoiceSynthesizer", BindingFlags.Instance | BindingFlags.NonPublic)
                                    .GetValue(synthesizer, null);

    var ttsVoice = voiceSynthesis.GetType()
                                 .GetMethod("CurrentVoice", BindingFlags.Instance | BindingFlags.NonPublic)
                                 .Invoke(voiceSynthesis, new object[] { false });

    var waveFormat = (byte[])ttsVoice.GetType()
                                     .GetField("_waveFormat", BindingFlags.Instance | BindingFlags.NonPublic)
                                     .GetValue(ttsVoice);

    var pin = GCHandle.Alloc(waveFormat, GCHandleType.Pinned);
    var format = (WAVEFORMATEX)Marshal.PtrToStructure(pin.AddrOfPinnedObject(), typeof(WAVEFORMATEX));
    pin.Free();

    return format;
}

为了测试它，我在SpeechSynthesizer s = new SpeechSynthesizer(); s.Speak("Hello"); var format = GetCurrentWaveFormat(s); Debug.WriteLine($"{s.Voice.SupportedAudioFormats.Count} formats are claimed as supported."); Debug.WriteLine($"Actual format: {format.nChannels} channel {format.nSamplesPerSec} Hz {format.wBitsPerSample} audio");下重命名了Microsoft Anna的AudioFormats注册表项，导致HKLM/Software/Wow6432Node/Microsoft/Speech/Voices/Tokens/MS-Anna-1033-20-Dsk/Attributes在查询时没有元素。以下是这种情况下的输出：

SpeechSynthesizer.Voice.SupportedAudioFormats

Answer 2

您无法从代码中获取此信息。您只能听所有格式（从8 kHz之类的不良格式到48 kHz之类的高质量格式），然后观察它在哪里变得越来越好，这就是您所做的。

在内部，语音引擎仅对原始音频格式的语音“询问”一次，我相信此值仅由语音引擎在内部使用，并且语音引擎不会以任何方式公开此值。 / p>

有关更多信息：

假设您是一家语音公司。您已经以16 kHz，16位单声道录制了计算机语音。

用户可以让您的声音以48 kHz（32位）立体声讲话。语音引擎执行此转换。语音引擎并不在乎它是否真的听起来更好，它只是在进行格式转换。

假设用户想让您的声音说话。他要求将文件保存为48 kHz，16位，立体声。

SAPI / System.Speech使用以下方法呼叫您的声音：

STDMETHODIMP SpeechEngine::GetOutputFormat(const GUID * pTargetFormatId, const WAVEFORMATEX * pTargetWaveFormatEx,
GUID * pDesiredFormatId, WAVEFORMATEX ** ppCoMemDesiredWaveFormatEx)
{
    HRESULT hr = S_OK;

    //Here we need to return which format our audio data will be that we pass to the speech engine.
    //Our format (16 kHz, 16 bit, mono) will be converted to the format that the user requested. This will be done by the SAPI engine.

    enum SPSTREAMFORMAT sample_rate_at_which_this_voice_was_recorded = SPSF_16kHz16BitMono; //Here you tell the speech engine which format the data has that you will pass back. This way the engine knows if it should upsample you voice data or downsample to match the format that the user requested.

    hr = SpConvertStreamFormatEnum(sample_rate_at_which_this_voice_was_recorded, pDesiredFormatId, ppCoMemDesiredWaveFormatEx);

    return hr;
}

这是您唯一需要“公开”语音录制格式的地方。

所有“可用格式”都告诉您声卡/ Windows可以进行哪些转换。

我希望我能解释清楚吗？作为语音供应商，您不支持任何格式。您只需告诉他们语音引擎，您的音频数据是什么格式，以便它可以进行进一步的转换。

如何找到SpeechSynthesizer所选语音的音频格式

2 个答案:

SupportedAudioFormats

艰难地找到这些信息