为了为我的视频生成字幕,我将它们转换为音频文件并使用了Cloud Speech-to-Text。它可以工作,但只能生成转录,而我需要的是*.srt
/ *.vtt
/类似文件。
我需要的是YouTube的工作:生成字幕并将其与视频同步,就像字幕格式一样,即:将字幕与出现字幕的时间一起复制。
尽管我可以将它们上传到YouTube,然后下载其自动生成的字幕,但这似乎不太正确。
是否可以使用Google Cloud Speech生成SRT文件(或类似文件)?
答案 0 :(得分:4)
实际上无法直接通过语音转文本API来执行此操作。您可以尝试对语音识别结果进行一些后处理。
例如,这是一个使用transcribe video的模型向REST API的请求,并带有由Google提供的公共示例文件:
curl -s -H "Content-Type: application/json" \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
https://speech.googleapis.com/v1p1beta1/speech:longrunningrecognize \
--data "{
'config': {
'encoding': 'LINEAR16',
'sampleRateHertz': 16000,
'languageCode': 'en-US',
'enableWordTimeOffsets': true,
'enableAutomaticPunctuation': true,
'model': 'video'
},
'audio': {
'uri':'gs://cloud-samples-tests/speech/Google_Gnome.wav'
}
}"
以上方法使用异步识别(speech:longrunningrecognize
),它更适合大型文件。 Enabling punctuation('enableAutomaticPunctuation': true
)与每个句子开头和结尾附近的单词'enableWordTimeOffsets': true
的{{3}}组合(您还必须将其从nanos转换为时间戳)可以让您在start and end times中提供文本文件。您可能还必须包括一些有关在任何给定时间出现在屏幕上的句子的最大长度的规则。
上面的实现应该不会太困难,但是,很可能仍然会遇到定时/同步问题。
答案 1 :(得分:2)
无法使用Google Cloud本身购买的方式进行操作,因为建议您对结果进行后处理。
In this file我做了一个快速的代码来完成这项工作。您可能想要使其适应您的需求:
function convertGSTTToSRT(string) {
var obj = JSON.parse(string);
var i = 1;
var result = ''
for (const line of obj.response.results) {
result += i++;
result += '\n'
var word = line.alternatives[0].words[0]
var time = convertSecondStringToRealtime(word.startTime);
result += formatTime(time) + ' --> '
var word = line.alternatives[0].words[line.alternatives[0].words.length - 1]
time = convertSecondStringToRealtime(word.endTime);
result += formatTime(time) + '\n'
result += line.alternatives[0].transcript + '\n\n'
}
return result;
}
function formatTime(time) {
return String(time.hours).padStart(2, '0')+ ':' + String(time.minutes).padStart(2, '0') + ':' +
String(time.seconds).padStart(2, '0') + ',000';
}
function convertSecondStringToRealtime(string) {
var seconds = string.substring(0, string.length - 1);
var hours = Math.floor(seconds / 3600);
var minutes = Math.floor(seconds % 3600 / 60);
seconds = Math.floor(seconds % 3600 % 60);
return {
hours, minutes, seconds
}
}
答案 2 :(得分:0)
使用此请求参数“ enable_word_time_offsets:True”来获取单词组的时间戳。然后以编程方式创建一个srt。
答案 3 :(得分:0)
这是我使用的代码
import math
import json
import datetime
def to_hms(s):
m, s = divmod(s, 60)
h, m = divmod(m, 60)
return '{}:{:0>2}:{:0>2}'.format(h, m, s)
def srt_generation(filepath, filename):
filename = 'DL_BIRTHDAY'
with open('{}{}.json'.format(filepath, filename), 'r') as file:
data = file.read()
results = json.loads(data)['response']['annotationResults'][0]['speechTranscriptions']
processed_results = []
counter = 1
lines = []
wordlist = []
for transcription in results:
alternative = transcription['alternatives'][0]
if alternative.has_key('transcript'):
# print(counter)
# lines.append(counter)
tsc = alternative['transcript']
stime = alternative['words'][0]['startTime'].replace('s','').split('.')
etime = alternative['words'][-1]['endTime'].replace('s','').split('.')
if(len(stime) == 1):
stime.append('000')
if(len(etime) == 1):
etime.append('000')
lines.append('{}\n{},{} --> {},{}\n{}\n\n\n'.format(counter, to_hms(int(stime[0])), stime[1], to_hms(int(etime[0])), etime[1],tsc.encode('ascii', 'ignore')))
counter = counter+1
wordlist.extend(alternative['words'])
srtfile = open('{}{}.srt'.format(filepath, filename), 'wr')
srtfile.writelines(lines)
srtfile.close()
## Now generate 3 seconds duration chunks of those words.
lines = []
counter = 1
strtime =0
entime = 0
words = []
standardDuration = 3
srtcounter = 1
for word in wordlist:
stime = word['startTime'].replace('s','').split('.')
etime = word['endTime'].replace('s','').split('.')
if(len(stime) == 1):
stime.append('000 ')
if(len(etime) == 1):
etime.append('000')
if(counter == 1):
strtime = '{},{}'.format(stime[0], stime[1])
entime = '{},{}'.format(etime[0], etime[1])
words.append(word['word'])
else:
tempstmime = int(stime[0])
tempentime = int(etime[0])
stimearr = strtime.split(',')
etimearr = entime.split(',')
if(tempentime - int(strtime.split(',')[0]) > standardDuration ):
transcript = ' '.join(words)
lines.append('{}\n{},{} --> {},{}\n{}\n\n\n'.format(srtcounter, to_hms(int(stimearr[0])), stimearr[1], to_hms(int(etimearr[0])), etimearr[1],transcript.encode('ascii', 'ignore')))
srtcounter = srtcounter+1
words = []
strtime = '{},{}'.format(stime[0], stime[1])
entime = '{},{}'.format(etime[0], etime[1])
words.append(' ')
words.append(word['word'])
else:
words.append(' ')
words.append(word['word'])
entime = '{},{}'.format(etime[0], etime[1])
counter = counter +1
if(len(words) > 0):
tscp = ' '.join(words)
stimearr = strtime.split(',')
etimearr = entime.split(',')
lines.append('{}\n{},{} --> {},{}\n{}\n\n\n'.format(srtcounter, to_hms(int(stimearr[0])), stimearr[1], to_hms(int(etimearr[0])), etimearr[1],tscp.encode('ascii', 'ignore')))
srtfile = open('{}{}_3_Sec_Custom.srt'.format(filepath, filename), 'wr')
srtfile.writelines(lines)
srtfile.close()