Question

只需检查以确保应支持此功能。 here页显示您应该能够使用至少16kHz的任何PCM文件。我正在尝试使用NAudio将较长的wav文件分割成语音，并且可以生成文件，但是我提交的所有训练数据都返回了处理错误，“仅接受RIFF（WAV）格式。音频文件的格式。”音频文件是16位PCM，单声道，44kHz wav文件，并且都在60s以下。我可能会丢失的文件格式还有其他限制吗？ wav文件确实具有有效的RIFF头（已验证字节存在）。

Answer 1

我设法通过明确地重新编码从SpeechRecognizer收到的音频来解决这个问题。绝对不是一种有效的解决方案，但这只是测试事情的一种手段。这是供参考的代码（将其放入Recognizer.Recognized）：

string rawResult = ea.Result.ToString();  //can get access to raw value this way.
Regex r = new Regex(@".*Offset"":(\d*),.*");
UInt64 offset = Convert.ToUInt64(r?.Match(rawResult)?.Groups[1]?.Value);
r = new Regex(@".*Duration"":(\d*),.*");
UInt64 duration = Convert.ToUInt64(r?.Match(rawResult)?.Groups[1]?.Value);

//create segment files
File.AppendAllText($@"{path}\{fileName}\{fileName}.txt", $"{segmentNumber}\t{ea.Result.Text}\r\n");

//offset and duration are in 100ns units
WaveFileReader w = new WaveFileReader(v);
long totalDurationInMs = w.SampleCount / w.WaveFormat.SampleRate * 1000;  //total length of the file
ulong offsetInMs = offset / 10000;  //convert from 100ns intervals to ms
ulong durationInMs = duration / 10000;
long bytesPerMilliseconds = w.WaveFormat.AverageBytesPerSecond / 1000;
w.Position = bytesPerMilliseconds * (long)offsetInMs;
long bytesToRead = bytesPerMilliseconds * (long)durationInMs;
byte[] buffer = new byte[bytesToRead];
int bytesRead = w.Read(buffer, 0, (int)bytesToRead);
string wavFileName = $@"{path}\{fileName}\{segmentNumber}.wav";
string tempFileName = wavFileName + ".tmp";
WaveFileWriter wr = new WaveFileWriter(tempFileName, w.WaveFormat);
wr.Write(buffer, 0, bytesRead);
wr.Close();

//this is probably really inefficient, but it's also the simplest way to get things in the right format.  It's a prototype-deal with it...
WaveFileReader r2 = new WaveFileReader(tempFileName);
//from other project
var desiredOutputFormat = new WaveFormat(16000, 16, 1);
using (var converter = new WaveFormatConversionStream(desiredOutputFormat, r2))
{
    WaveFileWriter.CreateWaveFile(wavFileName, converter);
}

segmentNumber++;

这会将输入文件拆分为单独的按转弯文件，并使用文件名将转弯记录附加到文本文件中。

好消息是，这产生了一个“有效”数据集，并且我能够从中创建声音。坏消息是语音字体产生的音频几乎是完全听不清的，我将这归因于使用机器转录的样本以及不规则的转弯中断和可能嘈杂的音频的组合。我可能会看到是否可以通过手动编辑一些文件来提高准确性，但是我至少想在此处发布答案，以防其他人遇到相同的问题。

此外，似乎16 KHz和44 KHz PCM都可以使用自定义语音，因此如果您有更高质量的音频可用，那么这是一个加分。

MS认知自定义语音提交示例数据返回“仅接受RIFF（WAV）格式。请检查音频文件的格式。”

1 个答案: