我正在使用Google语音API和NAudio(使用NAudio WaveInEvent类)对文本进行语音转换。像这样:https://cloud.google.com/speech-to-text/docs/streaming-recognize?hl=en(“在音频流上执行流语音识别”的C#示例)
如果讲话者靠近麦克风,则一切正常且快速。但是,如果讲话的人离麦克风很远,则他的前3-5个字不会被识别。之后,其他单词被很好地识别。 (因此,距离不会是一个普遍的问题)更像是距离的适应问题,或者NAudio可能不会使用100%的音量输入进行录音。
对这个问题有什么想法吗?
编辑:这是要求的代码:
static async Task<object> StreamingMicRecognizeAsync(int seconds)
{
if (NAudio.Wave.WaveIn.DeviceCount < 1)
{
Console.WriteLine("No microphone!");
return -1;
}
var speech = SpeechClient.Create();
var streamingCall = speech.StreamingRecognize();
// Write the initial request with the config.
await streamingCall.WriteAsync(
new StreamingRecognizeRequest()
{
StreamingConfig = new StreamingRecognitionConfig()
{
Config = new RecognitionConfig()
{
Encoding =
RecognitionConfig.Types.AudioEncoding.Linear16,
SampleRateHertz = 16000,
LanguageCode = "en",
},
InterimResults = true,
}
});
// Print responses as they arrive.
Task printResponses = Task.Run(async () =>
{
while (await streamingCall.ResponseStream.MoveNext(
default(CancellationToken)))
{
foreach (var result in streamingCall.ResponseStream
.Current.Results)
{
foreach (var alternative in result.Alternatives)
{
Console.WriteLine(alternative.Transcript);
}
}
}
});
// Read from the microphone and stream to API.
object writeLock = new object();
bool writeMore = true;
var waveIn = new NAudio.Wave.WaveInEvent();
waveIn.DeviceNumber = 0;
waveIn.WaveFormat = new NAudio.Wave.WaveFormat(16000, 1);
waveIn.DataAvailable +=
(object sender, NAudio.Wave.WaveInEventArgs args) =>
{
lock (writeLock)
{
if (!writeMore) return;
streamingCall.WriteAsync(
new StreamingRecognizeRequest()
{
AudioContent = Google.Protobuf.ByteString
.CopyFrom(args.Buffer, 0, args.BytesRecorded)
}).Wait();
}
};
waveIn.StartRecording();
Console.WriteLine("Speak now.");
await Task.Delay(TimeSpan.FromSeconds(seconds));
// Stop recording and shut down.
waveIn.StopRecording();
lock (writeLock) writeMore = false;
await streamingCall.WriteCompleteAsync();
await printResponses;
return 0;
}
来源:https://cloud.google.com/speech-to-text/docs/streaming-recognize?hl=en
答案 0 :(得分:0)
是的,这就是工作原理。引擎会根据声音水平进行调整,如果声音水平过低,它们只会遗漏第一个单词,只有在调整之后才能开始识别。准确性将低于预期。
要解决此问题,请使用更高级的麦克风阵列,该阵列将跟踪音频源(如“扬声器”或“矩阵”),并可能使用定制的语音识别系统,该系统对于快速更改音频电平更加可靠。它也将比Google API便宜。
答案 1 :(得分:0)
Cloud Speech API具有best practices,以使其最佳工作,其中包括:
识别器旨在忽略背景声音和噪声,而无需额外的噪声消除。但是,为了获得最佳结果,请将麦克风尽可能靠近用户放置,尤其是在存在背景噪音的情况下。