Question

我有许多音频文件，在这些文件中，我想在语音开始和结束时自动添加时间戳。因此，发话开始时会有一个“开始”时间戳。话语结束时还有一个“停止”时间戳。

赞：

start,stop
0:00:02.40,0:00:11.18
0:00:18.68,0:00:19.77
...

我测试了以下解决方案，并且可以正常工作：Split audio files using silence detection问题是我只能从中获取数据块，这使得将时间戳与原始音频进行匹配有些困难

任何朝着正确方向的解决方案或微调将不胜感激！

Answer 1

理想情况下，将ML算法与全面的测试/训练数据结合使用将产生一个动态解决方案，该解决方案可能不需要手动调整静默长度和阈值变量。

但是，可以使用pydub的detect_nonsilent方法来设计一个简单的静态解决方案。此方法以连续方式返回非静音块的开始和停止时间。

以下参数会影响可能需要调整的结果。

min_silence_len ：音频中所需的最小静音长度（以毫秒为单位）。
silence_thresh ：低于此阈值的任何内容都被视为沉默。

在尝试过程中，我确实注意到，在通过 detect_nonsilent 方法运行之前，对音频进行标准化非常有帮助，可能是因为应用了增益以实现平均幅度水平，这使得检测静音变得更加容易。

样本音频文件是从open speech repo下载的。每个音频文件有10个口头句子，两者之间有一定的间隔。

这是一个有效的演示代码：

from pydub import AudioSegment
from pydub.silence import detect_nonsilent

#adjust target amplitude
def match_target_amplitude(sound, target_dBFS):
    change_in_dBFS = target_dBFS - sound.dBFS
    return sound.apply_gain(change_in_dBFS)

#Convert wav to audio_segment
audio_segment = AudioSegment.from_wav("OSR_us_000_0010_8k.wav")

#normalize audio_segment to -20dBFS 
normalized_sound = match_target_amplitude(audio_segment, -20.0)
print("length of audio_segment={} seconds".format(len(normalized_sound)/1000))

#Print detected non-silent chunks, which in our case would be spoken words.
nonsilent_data = detect_nonsilent(normalized_sound, min_silence_len=500, silence_thresh=-20, seek_step=1)

#convert ms to seconds
print("start,Stop")
for chunks in nonsilent_data:
    print( [chunk/1000 for chunk in chunks])

结果：

root# python nonSilence.py 
length of audio_segment=33.623 seconds
start,Stop
[0.81, 2.429]
[4.456, 5.137]
[8.084, 8.668]
[11.035, 12.334]
[14.387, 15.601]
[17.594, 18.133]
[20.733, 21.289]
[24.007, 24.066]
[27.372, 27.977]
[30.361, 30.996]

如大胆所见（差异如下所示），我们的结果接近0.1-0.4秒的偏移量。调整detect_nonsilent参数可能会有所帮助。

Count From Script   From Audacity
1   0.81-2.429      0.573-2.833
2   4.456-5.137     4.283-6.421
3   8.084-8.668     7.824-9.679
4   11.035-12.334   10.994-12.833
5   14.387-15.601   14.367-16.120
6   17.594-18.133   17.3-19.021
7   20.773-21.289   20.471-22.258
8   24.007-24.066   23.843-25.664
9   27.372-27.977   27.081-28.598
10  30.361, 30.996  30.015-32.240

Answer 2

您可以执行与pydub solution you posted above类似的操作，但是可以使用detect_silence函数（来自pydub.silence import detect_silence），该函数将为您提供“静默范围”，即每个静默期的开始和结束。消极的形象-从停止开始，从停止开始-是无声的时期。有人显示了使用detect_silence here

的示例

编辑：
这是链接中的示例代码（以防链接断开）：

<a href="#" onclick="tablesToExcel(['summary','table1'], ['Report Profit'], 'myfile.xls')">
  <img src="/portalDispensary/vendors/tableExport/icon/xls.png" width="24px" class="mrx">
  "Export to Excel"
</a>

使用python从音频获取时间戳

2 个答案: