我想在单个mp3输出中将对Google云文本语音API的两个请求合并。我需要组合两个请求的原因是输出应包含两种不同的语言。
以下代码对许多语言对组合都适用,但不幸的是,并非对所有语言都适用。如果我要求用英语写一个句子,用德语写一个句子,将它们结合起来,一切正常。如果我要用英文要求一个,而在日本请一个,则不能将两个文件合并到一个输出中。输出仅包含第一句,而不包含第二句,它输出静默。
我现在尝试了多种方法来组合两个输出,但是结果保持不变。下面的代码应显示该问题。
请首先使用以下代码运行代码: python synthesize_bug.py --t1'Hallo'--code1 de-De --t2'August'--code2 de-De 效果很好。
python synthesize_bug.py --t1'Hallo'--code1 de-De --t2'こんにちは'--code2 ja-JP 这行不通。单个文件可以,但合并的文件包含静默而不是日语部分。 另外,如果与两个日语句子一起使用,则一切正常。
我已经在Google提交了错误报告,但没有任何回应,但也许只是我在编码假设方面做错了。希望有人有一个主意。
#!/usr/bin/env python
import argparse
# [START tts_synthesize_text_file]
def synthesize_text_file(text1, text2, code1, code2):
"""Synthesizes speech from the input file of text."""
from apiclient.discovery import build
import base64
service = build('texttospeech', 'v1beta1')
collection = service.text()
data1 = {}
data1['input'] = {}
data1['input']['ssml'] = '<speak><break time="2s"/></speak>'
data1['voice'] = {}
data1['voice']['ssmlGender'] = 'FEMALE'
data1['voice']['languageCode'] = code1
data1['audioConfig'] = {}
data1['audioConfig']['speakingRate'] = 0.8
data1['audioConfig']['audioEncoding'] = 'MP3'
request = collection.synthesize(body=data1)
response = request.execute()
audio_pause = base64.b64decode(response['audioContent'].decode('UTF-8'))
raw_pause = response['audioContent']
ssmlLine = '<speak>' + text1 + '</speak>'
data1 = {}
data1['input'] = {}
data1['input']['ssml'] = ssmlLine
data1['voice'] = {}
data1['voice']['ssmlGender'] = 'FEMALE'
data1['voice']['languageCode'] = code1
data1['audioConfig'] = {}
data1['audioConfig']['speakingRate'] = 0.8
data1['audioConfig']['audioEncoding'] = 'MP3'
request = collection.synthesize(body=data1)
response = request.execute()
# The response's audio_content is binary.
with open('output1.mp3', 'wb') as out:
out.write(base64.b64decode(response['audioContent'].decode('UTF-8')))
print('Audio content written to file "output1.mp3"')
audio_text1 = base64.b64decode(response['audioContent'].decode('UTF-8'))
raw_text1 = response['audioContent']
ssmlLine = '<speak>' + text2 + '</speak>'
data2 = {}
data2['input'] = {}
data2['input']['ssml'] = ssmlLine
data2['voice'] = {}
data2['voice']['ssmlGender'] = 'MALE'
data2['voice']['languageCode'] = code2 #'ko-KR'
data2['audioConfig'] = {}
data2['audioConfig']['speakingRate'] = 0.8
data2['audioConfig']['audioEncoding'] = 'MP3'
request = collection.synthesize(body=data2)
response = request.execute()
# The response's audio_content is binary.
with open('output2.mp3', 'wb') as out:
out.write(base64.b64decode(response['audioContent'].decode('UTF-8')))
print('Audio content written to file "output2.mp3"')
audio_text2 = base64.b64decode(response['audioContent'].decode('UTF-8'))
raw_text2 = response['audioContent']
result = audio_text1 + audio_pause + audio_text2
with open('result.mp3', 'wb') as out:
out.write(result)
print('Audio content written to file "result.mp3"')
raw_result = raw_text1 + raw_pause + raw_text2
with open('raw_result.mp3', 'wb') as out:
out.write(base64.b64decode(raw_result.decode('UTF-8')))
print('Audio content written to file "raw_result.mp3"')
# [END tts_synthesize_text_file]ls
if __name__ == '__main__':
parser = argparse.ArgumentParser(
description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument('--t1')
parser.add_argument('--code1')
parser.add_argument('--t2')
parser.add_argument('--code2')
args = parser.parse_args()
synthesize_text_file(args.t1, args.t2, args.code1, args.code2)
答案 0 :(得分:0)
您可以在这里找到答案: https://issuetracker.google.com/issues/120687867
简短的回答:尚不清楚为什么它不起作用,但Google建议一种解决方法,首先将文件写为.wav,合并然后再将结果重新编码为mp3。
答案 1 :(得分:0)
我在 NodeJS 中只用一个函数就成功地做到了这一点(不知道它有多优化,但至少它是有效的)。或许你可以从中汲取灵感 我使用了 npm 中的内存流依赖
var streams = require('memory-streams');
function mergeAudios(audios) {
var reader = new streams.ReadableStream();
var writer = new streams.WritableStream();
audios.forEach(element => {
if (element instanceof streams.ReadableStream) {
element.pipe(writer)
}
else {
writer.write(element)
}
});
reader.append(writer.toBuffer())
return reader
}
输入参数是一个列表,其中包含来自 SynthesizeSpeech 操作的 ReadableStream 或 responce.audioContent。如果是可读流,则使用管道操作,如果是音频内容,则使用写方法。最后,所有内容都被传递到可读流中。