Python中的音频文件语音识别 - 单词的位置,以秒为单位

时间:2017-11-05 12:37:04

标签: python speech-to-text

我一直在尝试使用python语音识别库https://pypi.python.org/pypi/SpeechRecognition/

阅读BBC出货预测的下载版本。将这些文件从现场广播剪辑到iplayer显然是自动化的并且不是非常准确 - 因此通常在预测本身开始之前会有一些音频 - 预告片或新闻的结尾。我不需要那么准确,但我想让语音识别能够识别“和现在的运输预测”这一短语(或者只是'运送'会实际发生)并从那里剪切文件。

到目前为止,我的代码(通过示例获取)转录和预测的音频文件,并使用公式(基于每分钟200字)来预测运输单词的来源,但事实并非如此。

有没有办法获得pocketphinx本身为该单词检测到的实际“帧”或第二次发作?我在文档中找不到任何内容。任何想法?

import speech_recognition as sr

AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "test_short2.wav")

# use the audio file as the audio source
r = sr.Recognizer()
with sr.AudioFile(AUDIO_FILE) as source:
    audio = r.record(source)  # read the entire audio file

# recognize speech using Sphinx
try:
    print "Sphinx thinks you said "
    returnedSpeech = str(r.recognize_sphinx(audio))

    wordsList = returnedSpeech.split()
    print returnedSpeech
    print "predicted loacation of start ", float(wordsList.index("shipping")) * 0.3


except sr.UnknownValueError:
    print("Sphinx could not understand audio")
except sr.RequestError as e:
    print("Sphinx error; {0}".format(e))

1 个答案:

答案 0 :(得分:1)

你需要直接使用pocketsphinx API来做这些事情。强烈建议您阅读pocketsphinx documentation on keyword spotting

您可以找到example中所示的关键短语:

config = Decoder.default_config()
config.set_string('-hmm', os.path.join(modeldir, 'en-us/en-us'))
config.set_string('-dict', os.path.join(modeldir, 'en-us/cmudict-en-us.dict'))
config.set_string('-keyphrase', 'shipping forecast')
config.set_float('-kws_threshold', 1e-30)

stream = open(os.path.join(datadir, "test_short2.wav"), "rb")

decoder = Decoder(config)
decoder.start_utt()
while True:
    buf = stream.read(1024)
    if buf:
         decoder.process_raw(buf, False, False)
    else:
         break
    if decoder.hyp() != None:
        print ([(seg.word, seg.prob, seg.start_frame, seg.end_frame) for seg in decoder.seg()])
        print ("Detected keyphrase, restarting search")
        decoder.end_utt()
        decoder.start_utt()