如何读取某些内容有换行符的文本文件?

时间:2017-05-07 18:47:29

标签: python

我有一个这种形式的文本文件:

06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde

您可以看到每一行都以换行符分隔,但某些行内容中有换行符。因此,简单地按行分隔并不能正确解析每一行。

作为一个例子,对于第5个条目,我希望我的输出     07/01/2016, 6:14 pm - abcde fghe

这是我目前的代码:

with open('file.txt', 'r') as text_file:
data = []
for line in text_file:
    row = line.strip()
    data.append(row)

1 个答案:

答案 0 :(得分:1)

根据您的示例输入,您可以使用具有前瞻性前瞻的regex

pat=re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M)

with open (fn) as f:
    pprint([m.group(1) for m in pat.finditer(f.read())])    

打印:

['06/01/2016, 10:40 pm - abcde\n',
 '07/01/2016, 12:04 pm - abcde\n',
 '07/01/2016, 12:05 pm - abcde\n',
 '07/01/2016, 12:05 pm - abcde\n',
 '07/01/2016, 6:14 pm - abcde\n\nfghe\n',
 '07/01/2016, 6:20 pm - abcde\n',
 '07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl\n',
 '07/01/2016, 7:58 pm - abcde\n']

使用Dropbox示例,打印:

['11/11/2015, 3:16 pm - IK: 12\n',
 '13/11/2015, 12:10 pm - IK: Hi.\n\nBut this is not about me.\n\nA donation, however small, will go a long way.\n\nThank you.\n',
 '13/11/2015, 12:11 pm - IK: Boo\n',
 '15/11/2015, 8:36 pm - IR: Root\n',
 '15/11/2015, 8:36 pm - IR: LaTeX?\n',
 '15/11/2015, 8:43 pm - IK: Ws\n']

如果您要删除捕获内容中的\n,只需将m.group(1).strip().replace('\n', '')添加到上面的列表推导中即可。

正则表达式的解释:

^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)

^                                                       start of line   
    ^  ^  ^  ^   ^                                      pattern for a date  
                       ^                                capture the rest...  
                           ^                            until (look ahead)
                                      ^ ^ ^             another date
                                                  ^     or
                                                     ^  end of string