我希望通过Google Cloud Speech API转录音频文件。这个简单的脚本将wav作为输入,并以非常高的精度转录它。
import os
import sys
import speech_recognition as sr
with open("~/Documents/speech-to-text/speech2textgoogleapi.json") as f:
GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read()
name = sys.argv[1] # wav file
r = sr.Recognizer()
all_text = []
with sr.AudioFile(name) as source:
audio = r.record(source)
# Transcribe audio file
text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
all_text.append(text)
with open("~/Documents/speech-to-text/transcript.txt", "w") as f:
f.write(str(all_text))
如何使用API从语音音频中提取其他有意义的信息?具体来说,我希望得到每个单词的时间戳,但其他信息(例如音高,振幅,说话人识别等)将非常受欢迎。提前谢谢!
答案 0 :(得分:2)
实际上有一个关于如何在
中的Speech API中执行此操作的示例Using Time offsets(TimeStamps):
Time offset (timestamp)值可以包含在响应文本中 为了您的认可请求。时间偏移值显示开始和 在提供的音频中识别的每个口语单词的结尾。一个 时间偏移值表示已经过的时间量 音频的开头,增量为100ms。
时间偏移对于分析较长的音频文件特别有用, 您可能需要搜索已识别的特定单词 文本并在原始音频中找到它(寻找)。时间偏移是 支持我们所有的识别方法:认识到, streamingrecognize和longrunningrecognize。请参阅下面的示例 longrunningrecognize .....
这是Python的代码示例:
def transcribe_gcs_with_word_time_offsets(gcs_uri):
"""Transcribe the given audio file asynchronously and output the word time
offsets."""
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types
client = speech.SpeechClient()
audio = types.RecognitionAudio(uri=gcs_uri)
config = types.RecognitionConfig(
encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
sample_rate_hertz=16000,
language_code='en-US',
enable_word_time_offsets=True)
operation = client.long_running_recognize(config, audio)
print('Waiting for operation to complete...')
result = operation.result(timeout=90)
for result in result.results:
alternative = result.alternatives[0]
print('Transcript: {}'.format(alternative.transcript))
print('Confidence: {}'.format(alternative.confidence))
for word_info in alternative.words:
word = word_info.word
start_time = word_info.start_time
end_time = word_info.end_time
print('Word: {}, start_time: {}, end_time: {}'.format(
word,
start_time.seconds + start_time.nanos * 1e-9,
end_time.seconds + end_time.nanos * 1e-9))
希望这有帮助。