从youtube的Webvtt标题中提取明文

时间:2018-05-17 22:27:50

标签: youtube youtube-dl closed-captions webvtt

使用youtube-dl --write-auto-sub,我们会得到一个这样的文件:

WEBVTT
Kind: captions
Language: en
Style:
::cue(c.colorCCCCCC) { color: rgb(204,204,204);
 }
::cue(c.colorE5E5E5) { color: rgb(229,229,229);
 }
##

00:00:00.030 --> 00:00:02.619 align:start position:0%

<c.colorE5E5E5>because<00:00:00.630><c> then</c><00:00:00.780><c> media</c><00:00:01.079><c> tries</c><00:00:01.380><c> to</c><00:00:01.589><c> sell</c><00:00:01.800><c> chips</c></c><c.colorCCCCCC><00:00:02.129><c> a</c></c>

00:00:02.619 --> 00:00:02.629 align:start position:0%
<c.colorE5E5E5>because then media tries to sell chips</c><c.colorCCCCCC> a
 </c>

00:00:02.629 --> 00:00:05.869 align:start position:0%
<c.colorE5E5E5>because then media tries to sell chips</c><c.colorCCCCCC> a
lot<00:00:03.629><c> of</c><00:00:03.870><c> chips</c></c><c.colorE5E5E5><00:00:04.200><c> into</c></c><c.colorCCCCCC><00:00:04.560><c> the</c></c><c.colorE5E5E5><00:00:04.890><c> Android</c><00:00:05.279><c> Market</c><00:00:05.700><c> and</c></c>

00:00:05.869 --> 00:00:05.879 align:start position:0%
lot of chips<c.colorE5E5E5> into</c><c.colorCCCCCC> the</c><c.colorE5E5E5> Android Market and
 </c>

00:00:05.879 --> 00:00:08.900 align:start position:0%
lot of chips<c.colorE5E5E5> into</c><c.colorCCCCCC> the</c><c.colorE5E5E5> Android Market and
NVIDIA</c><c.colorCCCCCC><00:00:06.600><c> has</c></c><c.colorE5E5E5><00:00:06.839><c> been</c><00:00:07.109><c> the</c><00:00:07.350><c> single</c><00:00:07.980><c> worst</c><00:00:08.280><c> company</c></c>

00:00:08.900 --> 00:00:08.910 align:start position:0%
NVIDIA<c.colorCCCCCC> has</c><c.colorE5E5E5> been the single worst company
 </c>

00:00:08.910 --> 00:00:14.420 align:start position:0%
NVIDIA<c.colorCCCCCC> has</c><c.colorE5E5E5> been the single worst company
we've<00:00:09.090><c> ever</c><00:00:09.389><c> dealt</c><00:00:09.719><c> with</c><00:00:09.870><c> so</c><00:00:10.620><c> Nvidia</c><00:00:11.090><c> fuck</c><00:00:12.090><c> you</c></c>

webvtt-py可用于提取颜色和时间信息,但为什么Youtube会生成重复的字幕?获得明文标题的最佳方法是什么?我试图忽略0.010秒长的所有字幕,但仍然有重叠的线条(也就是说,一行末尾的文字与下一行开头的文字重叠)。

0 个答案:

没有答案