Question

我想编写一个自动同步未同步字幕的程序。我想到的解决方案之一就是以某种方式通过算法找到人类的语言，并将子弹进行调整。我找到的API（Google Speech API，Yandex SpeechKit）可以与服务器一起使用（对我来说不是很方便），并且（可能）做了很多不必要的工作来确定究竟说了什么，而我只需要知道某些东西已经有了有人说过。

换句话说，我想给它音频文件并得到这样的东西：

[(00:12, 00:26), (01:45, 01:49) ... , (25:21, 26:11)]

是否有解决方案（最好是在python中）才能找到人类语音并在本地计算机上运行？

Answer 1

您尝试做的技术术语称为Voice Activity Detection (VAD)。有一个名为SPEAR的python库可以完成它（除其他外）。

Answer 2

您可以在音频文件中运行一个窗口，并尝试提取总信号的功率部分是人声（基频在50到300 Hz之间）。以下是给出直觉并且在真实音频上未经测试。

import scipy.fftpack as sf
import numpy as np
def hasHumanVoice(X, threshold, F_sample, Low_cutoff=50, High_cutoff= 300):
        """ Searching presence of frequencies on a real signal using FFT
        Inputs
        =======
        X: 1-D numpy array, the real time domain audio signal (single channel time series)
        Low_cutoff: float, frequency components below this frequency will not pass the filter (physical frequency in unit of Hz)
        High_cutoff: float, frequency components above this frequency will not pass the filter (physical frequency in unit of Hz)
        F_sample: float, the sampling frequency of the signal (physical frequency in unit of Hz)
        threshold: Has to be standardized once to say how much power must be there in real vocal signal frequencies.    
        """        

        M = X.size # let M be the length of the time series
        Spectrum = sf.rfft(X, n=M) 
        [Low_cutoff, High_cutoff, F_sample] = map(float, [Low_cutoff, High_cutoff, F_sample])

        #Convert cutoff frequencies into points on spectrum
        [Low_point, High_point] = map(lambda F: F/F_sample * M, [Low_cutoff, High_cutoff])

        totalPower = np.sum(Spectrum)
        fractionPowerInSignal = np.sum(Spectrum[Low_point : High_point])/totalPower # Calculating fraction of power in these frequencies

        if fractionPowerInSignal > threshold:
            return 1
        else:
            return 0

voiceVector = []
for window in fullAudio: # Run a window of appropriate length across the audio file
    voiceVector.append (hasHumanVoice( window, threshold, samplingRate)

Answer 3

webrtcvad是围绕Google卓越的WebRTC语音活动检测（VAD）实施的Python包装器 - 它在我使用过的任何VAD中都做得最好正确地对人类语音进行分类，即使是嘈杂的音频。

要将它用于您的目的，您可以这样做：

将文件转换为8 KHz或16 Khz，16位单声道格式。这是WebRTC代码所必需的。
创建VAD对象：vad = webrtcvad.Vad()
将音频拆分为30毫秒的块。
检查每个块以查看它是否包含语音：vad.is_speech(chunk, sample_rate)

VAD输出可能是＆＃34;嘈杂＆＃34;如果它将单个30毫秒的音频块分类为语音，你真的不想为此输出时间。您可能希望查看过去0.3秒（或左右）的音频，并查看该时段内大多数30毫秒的块是否被归类为语音。如果是，则输出该0.3秒周期的开始时间作为语音开始。然后你做类似的事情来检测语音何时结束：等待0.3秒的音频周期，其中大部分30毫秒的块不被VAD归类为语音 - 当发生这种情况时，输出结束时间作为语音结束

您可能需要稍微调整时间以获得良好的效果 - 也许您决定需要0.2秒的音频，其中超过30％的块在触发之前被VAD归类为语音，在解除触发之前，将超过50％的块分类为非语音的1.0秒音频。

环形缓冲区（Python中的collections.deque）是一种有用的数据结构，用于跟踪最后N个音频块及其分类。

有没有一种快速的方法来找到（不一定能识别）音频文件中的人类语音？

3 个答案: