我正在使用youtube-dl从youtube下载WebVTT文件。
典型文件如下:
WEBVTT
Kind: captions
Language: en
00:00:00.730 --> 00:00:05.200 align:start position:0%
[Applause]
00:00:05.200 --> 00:00:05.210 align:start position:0%
[Applause]
00:00:05.210 --> 00:00:11.860 align:start position:0%
[Applause]
hi<00:00:06.440><c> I'm</c><00:00:07.440><c> here</c><00:00:07.740><c> to</c><00:00:08.160><c> talk</c><00:00:08.429><c> to</c><00:00:09.019><c> share</c><00:00:10.019><c> an</c><00:00:10.469><c> idea</c><00:00:10.820><c> to</c>
00:00:11.860 --> 00:00:11.870 align:start position:0%
hi I'm here to talk to share an idea to
00:00:11.870 --> 00:00:15.890 align:start position:0%
hi I'm here to talk to share an idea to
communicate<00:00:12.920><c> but</c><00:00:13.920><c> what</c><00:00:14.790><c> is</c><00:00:14.940><c> communication</c>
00:00:15.890 --> 00:00:15.900 align:start position:0%
communicate but what is communication
我想要一个文本文件:
hi I'm here to talk to share an idea to
communicate but what is communication
使用我在网上找到的代码,我得到了:
cat output.vtt | sed "s/^[0-9]*[0-9\:\.\ \>\-]*//g" | grep -v "^WEBVTT\|^Kind: cap\|^Language" | awk 'BEGIN{ RS="\n\n+"; RS="\n\n" }NR>=2{ print }' > dialogues.txt
但这远非完美。我得到了很多无用的空格,所有句子显示了两次。你介意帮我吗?之前有人问过类似的问题,但提交的答案对我不起作用。
谢谢!
答案 0 :(得分:1)
您也许可以执行以下操作:
sed -e '1,4d' -E -e '/^$|]|>$|%$/d' output.vtt | awk '!seen[$0]++' > dialogues.txt
sed
删除前4行sed
然后删除任何空白行,或包含]
或以>
,%
结尾的空白行。awk
删除重复的行。结果:
hi I'm here to talk to share an idea to
communicate but what is communication
您可能需要对其进行一些调整,尽管它可以带来更多所需的结果。
答案 1 :(得分:1)
您可以尝试自己在单个awk
中关注吗?
awk 'FNR<=4 || ($0 ~ /^$|-->|\[|\]|</){next} !a[$0]++' Input_file
说明: 现在为上述代码添加说明。
awk ' ##Starting awk program here.
FNR<=4 || ($0 ~ /^$|-->|\[|\]|</){ ##Checking condition if line number is less than 4 OR having spaces or [ or ] or --> then go next line.
next ##next will skip all further statements from here.
}
!a[$0]++ ##Creating an array whose index is $0 and increment its value with 1 with condition that it should NOT be already present in array a, which means it will give only 1 value of each line.
' Input_file ##Mentioning Input_file name here.