用于组合txt文件中的行的Python

时间:2015-05-15 04:44:39

标签: python

关于txt文件中的组合行的问题。

文件内容如下(电影字幕)。我想把每个段落中的字幕,英文单词和句子组合成1行,而不是现在分别显示1,2或3行。

你能否告诉我哪种方法在Python中可行?非常感谢。

1
00:00:23,343 --> 00:00:25,678
Been a while since I was up here
in front of you.

2
00:00:25,762 --> 00:00:28,847
Maybe I'll do us all a favour
and just stick to the cards.

3
00:00:31,935 --> 00:00:34,603
There's been speculation that I was
involved in the events that occurred
on the freeway and the rooftop...

4
00:00:36,189 --> 00:00:39,233
Sorry, Mr Stark, do you
honestly expect us to believe that

5
00:00:39,317 --> 00:00:42,903
that was a bodyguard
in a suit that conveniently appeared,

6
00:00:42,987 --> 00:00:45,698
despite the fact
that you sorely despise bodyguards?

7
00:00:45,782 --> 00:00:46,907
Yes.

8
00:00:46,991 --> 00:00:51,662
And this mysterious bodyguard
was somehow equipped

4 个答案:

答案 0 :(得分:2)

直观的解决方案

基于您可以拥有的4种类型的简单解决方案:

  • 一个空行
  • 表示位置的数字(无字母)
  • 字幕的时间(具有特定模式;无字母)
  • 文本

您可以循环遍历每一行,对其进行分类,然后相应地采取行动。

实际上,非文本非空行(时间轴和数字)的“动作”是相同的。因此:

import re

with open('yourfile.txt') as f:
    exampleText = f.read()

new = ''

for line in exampleText.split('\n'):
    if line == '':
        new += '\n\n'
    elif re.search('[a-zA-Z]', line):  # check if there is text
        new += line + ' ' 
    else:
        new += line + '\n' 

结果:

>>> print(new)
1
00:00:23,343 --> 00:00:25,678
Been a while since I was up here in front of you. 

2
00:00:25,762 --> 00:00:28,847
Maybe I'll do us all a favour and just stick to the cards. 
...

正则表达式解释说:

  • []表示
  • 中的任何字符
  • a-z表示字符范围a-z
  • A-Z表示字符范围A-Z

答案 1 :(得分:1)

模式似乎是:

  1. 一行只有一个数字,
  2. 包含时间信息的下一行,
  3. 一行或多行文字,以空行分隔。
  4. 我会写一个读取第1行和第2行的循环,然后是一个读取第3行的嵌套循环,直到找到一个空行。这个嵌套循环可以将这些行连接成一行。

答案 2 :(得分:1)

仍然在第一线上工作..这是你所期望的。

ExpandoObject

答案 3 :(得分:1)

装载要求:

import re

with open('yourfile.txt') as f:
    exampleText = f.read()

简洁的单行

re.sub('\n([0-9]+)\n', '\n\n\g<1>\n', re.sub('([^0-9])\n', '\g<1> ', exampleText))

第一个替换替换所有以换行符结尾的文本,文本以空格结尾:

tmp = re.sub('([^0-9])\n', '\g<1> ', exampleText)

之前的替换意味着我们在文本的最后部分末尾丢失了换行符。然后第二次替换在这些数字行前添加换行符:

re.sub('\n([0-9]+)\n', '\n\n\g<1>\n', tmp)