Question

我有一个字幕文件。我希望unbreak所有字幕。一个例子：

1
00:02:08,315 --> 00:02:10,786
Hello Jim.
How are you?

2
00:02:10,869 --> 00:02:13,192
I'm well.
And you?

我想转换为：

1
00:02:08,315 --> 00:02:10,786
Hello Jim. How are you?

2
00:02:10,869 --> 00:02:13,192
I'm well. And you?

字幕编号和时间码不应该是unbreak。如何用sed完成？

Answer 1

你可以：

awk 'BEGIN { RS = ""; FS = "\n" }
     NR > 1 { print "" }
     { print $1; print $2;
       for (i = 3; i < NF; ++i) printf "%s ", $i;
       print $NF;
     }' your_file.txt

输出：

1
00:02:08,315 --> 00:02:10,786
Hello Jim. How are you?

2
00:02:10,869 --> 00:02:13,192
I'm well. And you?

Answer 2

这个小awk脚本将完成这项工作。它比需要的要复杂一点，但可以作为更高级处理的基础。也许...

awk 'BEGIN                     { state = "copy" }
     (state == "copy")         { print }
     /-->/                     { state = "text"; next }
     /.+/ && (state == "text") { printf("%s ",$0); next }
     /^$/                      { printf("\n\n"); state = "copy"; next }
     END                       { printf("\n") }
    ' < sub.txt

根据您的输入文件，这会产生：

1
00:02:08,315 --> 00:02:10,786
Hello Jim. How are you? 

2
00:02:10,869 --> 00:02:13,192
I'm well. And you?

<小时/> 修改在查看您作为对其他答案的评论的示例文件之后，我只能猜测您要合并连续的...行。所以这个简单的Perl技巧就足够了：

sh$ unzip 56939b22f5174a770a79f6b0b0cf7caaee1c9dfb.zip Archive: 56939b22f5174a770a79f6b0b0cf7caaee1c9dfb.zip inflating: Red.Planet.2000.1080p.REPACK.BluRay.x264-7SinS.srt sh$ perl -0pe 's|\r\n| |m' < Red.Planet.2000.1080p.REPACK.BluRay.x264-7SinS.srt 1 00:00:35,661 --> 00:00:40,792 By the year 2000, we had begun to overpopulate, pollute and poison our planet... 2 00:00:41,208 --> 00:00:43,176 ...faster than we could clean it up. 3 00:00:43,377 --> 00:00:48,053 We ignored the problem for as long as we could but we were kidding ourselves.

Answer 3

如果所有子标题栏都用空行分隔，并且您希望保留每个块的前两行并将其余部分与空间合并。然后你可以使用Perl：

perl -F'\n' -aln00e 'print "$F[0]\n$F[1]\n", (join" ",@F[2..$#F]), "\n"' myfile.txt

但是如果说出的行中有空行，则会被破坏。但我想你不会在意删除包含在口语中的空行。如果是这样，只需采取预处理步骤：

perl -lp0777e 's/\n\n+(?!\d+\n\d\d:\d\d:\d\d,\d\d\d\s*-->)/\n/g' myfile.txt

Answer 4

TXR语言的解决方案：

@(repeat)
@num
@fromtime --> @totime
@(collect)
@line
@(until)

@(end)
@(output)
@num
@fromtime --> @totime
@(rep)@line @(last)@line@(end)
@(end)
@(end)

执行命令

$ txr unbreak.txr sub.srt 
1
00:02:08,315 --> 00:02:10,786
Hello Jim. How are you?
2
00:02:10,869 --> 00:02:13,192
I'm well. And you?

即使我们精确地提取了SRT文件的更多功能，也可以轻松实现所需的输出，而不是完成工作。我们可以轻松地将代码弯曲成更复杂的转换。

Answer 5

此命令行也适用：

cat red.srt | tr '\012' '\040' | sed 's/[0-9]\+ ..:..:..,... --> ..:..:..,.../\n\0\n/g' | sed 's/^[0-9]\+ /\n\0\n/g' | sed 's/^ *//g; s/ \+/ /g; s/ *$//g' | sed '1,2d' > final.srt

我知道，这个解决方案并不优雅，但它对我来说非常适合。

破解特定的文本行

5 个答案: