我正在尝试从文件中提取一些信息。
文件:
ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 CONTENT ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 CONTENT
CONTENT
CONTENT
CONTENTCONTENTCONTENT CONTENTCONTENT
CONTENTCONTENTCONTENTCONTENTCONTENT
CONTENTCONTENTCONTENTCONTENTCONTENT
我想在此文件中执行多个模式,但是当我提取第一个信息时,其余信息(文件)为空。
import re
import pdb
w = open("extractfile.txt","r")
print w.read()
print re.findall(r'CONTENT', w.read())
print re.findall(r'\w{3} \d{2}-\d{2}-\d{2}', w.read())
输出:
ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 CONTENT ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 CONTENT
CONTENT
CONTENT
CONTENTCONTENTCONTENT CONTENTCONTENT
CONTENTCONTENTCONTENTCONTENTCONTENT
CONTENTCONTENTCONTENTCONTENTCONTENT
[]
[]
如果我更改打印顺序,它将始终显示第一张打印,其余的将变为空白...我认为另一件事是通过使用组将多行打印成一行,但我不知道它是否会工作
答案 0 :(得分:0)
>>>> import re
>>>> with open('extractfile.txt', 'r') as txt:
.... file = txt.read()
>>>> match = re.findall(r'CONTENT', file)
>>>> content
['CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT']
>>>> pattern = re.findall(r'(?P<asd>[\w]+ )(?P<dgt>[\d-]+)', file)
>>>> pattern
[('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34')]
之后的
[\w]+
空格也可以通过移出而从<asd>
组中排除,但是速度较慢,因为正则表达式最终会执行更多步骤。
re.findall(r'(?P<asd>[\w]+) (?P<dgt>[\d-]+)', file)