我想编写一个自动同步未同步字幕的程序。我想到的解决方案之一就是以某种方式通过算法找到人类的语言,并将子弹进行调整。我找到的API(Google Speech API,Yandex SpeechKit)可以与服务器一起使用(对我来说不是很方便),并且(可能)做了很多不必要的工作来确定究竟说了什么,而我只需要知道某些东西已经有了有人说过。
换句话说,我想给它音频文件并得到这样的东西:
[(00:12, 00:26), (01:45, 01:49) ... , (25:21, 26:11)]
是否有解决方案(最好是在python中)才能找到人类语音并在本地计算机上运行?
答案 0 :(得分:7)
您尝试做的技术术语称为Voice Activity Detection (VAD)。有一个名为SPEAR的python库可以完成它(除其他外)。
答案 1 :(得分:2)
您可以在音频文件中运行一个窗口,并尝试提取总信号的功率部分是人声(基频在50到300 Hz之间)。以下是给出直觉并且在真实音频上未经测试。
import scipy.fftpack as sf
import numpy as np
def hasHumanVoice(X, threshold, F_sample, Low_cutoff=50, High_cutoff= 300):
""" Searching presence of frequencies on a real signal using FFT
Inputs
=======
X: 1-D numpy array, the real time domain audio signal (single channel time series)
Low_cutoff: float, frequency components below this frequency will not pass the filter (physical frequency in unit of Hz)
High_cutoff: float, frequency components above this frequency will not pass the filter (physical frequency in unit of Hz)
F_sample: float, the sampling frequency of the signal (physical frequency in unit of Hz)
threshold: Has to be standardized once to say how much power must be there in real vocal signal frequencies.
"""
M = X.size # let M be the length of the time series
Spectrum = sf.rfft(X, n=M)
[Low_cutoff, High_cutoff, F_sample] = map(float, [Low_cutoff, High_cutoff, F_sample])
#Convert cutoff frequencies into points on spectrum
[Low_point, High_point] = map(lambda F: F/F_sample * M, [Low_cutoff, High_cutoff])
totalPower = np.sum(Spectrum)
fractionPowerInSignal = np.sum(Spectrum[Low_point : High_point])/totalPower # Calculating fraction of power in these frequencies
if fractionPowerInSignal > threshold:
return 1
else:
return 0
voiceVector = []
for window in fullAudio: # Run a window of appropriate length across the audio file
voiceVector.append (hasHumanVoice( window, threshold, samplingRate)
答案 2 :(得分:2)
webrtcvad是围绕Google卓越的WebRTC语音活动检测(VAD)实施的Python包装器 - 它在我使用过的任何VAD中都做得最好正确地对人类语音进行分类,即使是嘈杂的音频。
要将它用于您的目的,您可以这样做:
vad = webrtcvad.Vad()
vad.is_speech(chunk, sample_rate)
VAD输出可能是"嘈杂"如果它将单个30毫秒的音频块分类为语音,你真的不想为此输出时间。您可能希望查看过去0.3秒(或左右)的音频,并查看该时段内大多数30毫秒的块是否被归类为语音。如果是,则输出该0.3秒周期的开始时间作为语音开始。然后你做类似的事情来检测语音何时结束:等待0.3秒的音频周期,其中大部分30毫秒的块不被VAD归类为语音 - 当发生这种情况时,输出结束时间作为语音结束
您可能需要稍微调整时间以获得良好的效果 - 也许您决定需要0.2秒的音频,其中超过30%的块在触发之前被VAD归类为语音,在解除触发之前,将超过50%的块分类为非语音的1.0秒音频。
环形缓冲区(Python中的collections.deque
)是一种有用的数据结构,用于跟踪最后N个音频块及其分类。