如果下一行不匹配模式,则在Python中连接行

时间:2015-12-07 14:19:24

标签: python regex negative-lookahead

嗨我有一个看起来像这样的加号字幕文件:

00:00:29:02 00:00:35:00 text 1
text 2
00:00:36:04 00:00:44:08 text 3
text 4
00:00:44:12 00:00:48:00 text 5
00:00:49:17 00:00:52:17 text 6

在python中应该放什么而不是" HELP PLEASE"

newdata = re.sub("""HELP PLEASE""", r"\1", filedata)

生成这样的行:

00:00:29:02 00:00:35:00 text 1 text 2
00:00:36:04 00:00:44:08 text 3 text 4
00:00:44:12 00:00:48:00 text 5
00:00:49:17 00:00:52:17 text 6

谢谢

1 个答案:

答案 0 :(得分:1)

如果文件不是太大,您可以将每行读入新列表。如果某行不以时间戳开头,则弹出添加到new_lines的最后一行,然后将其添加回新行。

>>> import re
>>>
>>> # assume all_lines = somefile.readlines() or use it in the for loop below.
... # but simplying to this
... all_lines = [
... "00:00:29:02 00:00:35:00 text 1",
... "text 2",
... "00:00:36:04 00:00:44:08 text 3",
... "text 4",
... "00:00:44:12 00:00:48:00 text 5",
... "00:00:49:17 00:00:52:17 text 6",
... "text 7",  # added for interest
... "text 8",  # added for interest
... ]
>>>
>>> new_lines = []
>>> for line in all_lines:
...     if not re.match('(?:(?:\d\d:){3}(?:\d\d) ){2}.*', line):
...         # line did not start with a timestamp
...         new_lines.append(new_lines.pop() + ' ' + line)
...     else:
...         new_lines.append(line)
...
>>> print '\n'.join(new_lines)
00:00:29:02 00:00:35:00 text 1 text 2
00:00:36:04 00:00:44:08 text 3 text 4
00:00:44:12 00:00:48:00 text 5
00:00:49:17 00:00:52:17 text 6 text 7 text 8
>>>

使用prev_line变量并不是很难,你继续转储/屈服而不是潜在的大量new_lines

顺便说一句,如果第一行不是时间戳,这将失败。

PS:不知道为什么每个人都如此关于正则表达式。

编辑:不创建潜在的大量新行列表...

>>> prev_line = ''
>>> for line in all_lines:
...     if not re.match('(?:(?:\d\d:){3}(?:\d\d) ){2}.*', line):
...         prev_line += ' ' + line
...     else:
...         if prev_line:  # prevents the first flag '' prev_line from printing
...             print prev_line
...         prev_line = line
...
00:00:29:02 00:00:35:00 text 1 text 2
00:00:36:04 00:00:44:08 text 3 text 4
00:00:44:12 00:00:48:00 text 5
>>> print prev_line  # make sure to print/dump the last one
00:00:49:17 00:00:52:17 text 6 text 7 text 8
>>>

两个警告:(1。)如果一条线实际上是空白的,它将被跳过。 (2.)虽然带有prev_line的第二个版本即使文件很大也是内存有效的,但如果你有许多连续的没有时间戳的行(如第7行和第7行),它会占用内存8) - prev_line必须保留所有内容,直到有一个带时间戳的行。你可以通过转储到一个文件来解决它,没有显式换行符(\n)并在转储 以时间戳开头的行之前添加换行符。