将两个TTS输出合并到单个mp3文件中不起作用

时间:2018-12-14 21:21:52

标签: text-to-speech google-text-to-speech

我想在单个mp3输出中将对Google云文本语音API的两个请求合并。我需要组合两个请求的原因是输出应包含两种不同的语言。

以下代码对许多语言对组合都适用,但不幸的是,并非对所有语言都适用。如果我要求用英语写一个句子,用德语写一个句子,将它们结合起来,一切正常。如果我要用英文要求一个,而在日本请一个,则不能将两个文件合并到一个输出中。输出仅包含第一句,而不包含第二句,它输出静默。

我现在尝试了多种方法来组合两个输出,但是结果保持不变。下面的代码应显示该问题。

请首先使用以下代码运行代码: python synthesize_bug.py --t1'Hallo'--code1 de-De --t2'August'--code2 de-De 效果很好。

python synthesize_bug.py --t1'Hallo'--code1 de-De --t2'こんにちは'--code2 ja-JP 这行不通。单个文件可以,但合并的文件包含静默而不是日语部分。 另外,如果与两个日语句子一起使用,则一切正常。

我已经在Google提交了错误报告,但没有任何回应,但也许只是我在编码假设方面做错了。希望有人有一个主意。

#!/usr/bin/env python

import argparse

# [START tts_synthesize_text_file]
def synthesize_text_file(text1, text2, code1, code2):
    """Synthesizes speech from the input file of text."""
    from apiclient.discovery import build
    import base64

    service = build('texttospeech', 'v1beta1')
    collection = service.text()

    data1 = {}
    data1['input'] = {}
    data1['input']['ssml'] = '<speak><break time="2s"/></speak>'
    data1['voice'] = {}
    data1['voice']['ssmlGender'] = 'FEMALE'
    data1['voice']['languageCode'] = code1
    data1['audioConfig'] = {}
    data1['audioConfig']['speakingRate'] = 0.8
    data1['audioConfig']['audioEncoding'] = 'MP3'

    request = collection.synthesize(body=data1)
    response = request.execute() 
    audio_pause = base64.b64decode(response['audioContent'].decode('UTF-8'))
    raw_pause = response['audioContent']

    ssmlLine = '<speak>' + text1 + '</speak>' 

    data1 = {}
    data1['input'] = {}
    data1['input']['ssml'] = ssmlLine
    data1['voice'] = {}
    data1['voice']['ssmlGender'] = 'FEMALE'
    data1['voice']['languageCode'] = code1
    data1['audioConfig'] = {}
    data1['audioConfig']['speakingRate'] = 0.8
    data1['audioConfig']['audioEncoding'] = 'MP3'

    request = collection.synthesize(body=data1)
    response = request.execute() 

    # The response's audio_content is binary.
    with open('output1.mp3', 'wb') as out:
        out.write(base64.b64decode(response['audioContent'].decode('UTF-8')))
        print('Audio content written to file "output1.mp3"')

        audio_text1 = base64.b64decode(response['audioContent'].decode('UTF-8'))
        raw_text1 = response['audioContent']

    ssmlLine = '<speak>' + text2 + '</speak>' 

    data2 = {}
    data2['input'] = {}
    data2['input']['ssml'] = ssmlLine
    data2['voice'] = {}
    data2['voice']['ssmlGender'] = 'MALE'
    data2['voice']['languageCode'] = code2 #'ko-KR'
    data2['audioConfig'] = {}
    data2['audioConfig']['speakingRate'] = 0.8
    data2['audioConfig']['audioEncoding'] = 'MP3'

    request = collection.synthesize(body=data2)
    response = request.execute() 

    # The response's audio_content is binary.
    with open('output2.mp3', 'wb') as out:
        out.write(base64.b64decode(response['audioContent'].decode('UTF-8')))
        print('Audio content written to file "output2.mp3"')

    audio_text2 = base64.b64decode(response['audioContent'].decode('UTF-8'))
    raw_text2 = response['audioContent']

    result = audio_text1 + audio_pause + audio_text2
    with open('result.mp3', 'wb') as out:
        out.write(result)
    print('Audio content written to file "result.mp3"')

    raw_result = raw_text1 + raw_pause + raw_text2
    with open('raw_result.mp3', 'wb') as out:
        out.write(base64.b64decode(raw_result.decode('UTF-8')))
    print('Audio content written to file "raw_result.mp3"')
# [END tts_synthesize_text_file]ls



if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description=__doc__,
        formatter_class=argparse.RawDescriptionHelpFormatter)
    parser.add_argument('--t1')
    parser.add_argument('--code1')
    parser.add_argument('--t2')
    parser.add_argument('--code2')
    args = parser.parse_args()

    synthesize_text_file(args.t1, args.t2, args.code1, args.code2)

2 个答案:

答案 0 :(得分:0)

您可以在这里找到答案: https://issuetracker.google.com/issues/120687867

简短的回答:尚不清楚为什么它不起作用,但Google建议一种解决方法,首先将文件写为.wav,合并然后再将结果重新编码为mp3。

答案 1 :(得分:0)

我在 NodeJS 中只用一个函数就成功地做到了这一点(不知道它有多优化,但至少它是有效的)。或许你可以从中汲取灵感 我使用了 npm 中的内存流依赖

var streams = require('memory-streams');
function mergeAudios(audios) {
  var reader = new streams.ReadableStream();
  var writer = new streams.WritableStream();
  audios.forEach(element => {
    if (element instanceof streams.ReadableStream) {
      element.pipe(writer)
    }
    else {
      writer.write(element)
    }
  });
  reader.append(writer.toBuffer())
  return reader
}

输入参数是一个列表,其中包含来自 SynthesizeSpeech 操作的 ReadableStream 或 responce.audioContent。如果是可读流,则使用管道操作,如果是音频内容,则使用写方法。最后,所有内容都被传递到可读流中。