Python:匹配文件中的多个模式

时间:2018-10-10 16:57:50

标签: python regex

我正在尝试从文件中提取一些信息。

文件:

ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34  CONTENT ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34  CONTENT

CONTENT
CONTENT
CONTENTCONTENTCONTENT CONTENTCONTENT
CONTENTCONTENTCONTENTCONTENTCONTENT
CONTENTCONTENTCONTENTCONTENTCONTENT

我想在此文件中执行多个模式,但是当我提取第一个信息时,其余信息(文件)为空。

import re
import pdb

w = open("extractfile.txt","r")

print w.read()
print re.findall(r'CONTENT', w.read())
print re.findall(r'\w{3} \d{2}-\d{2}-\d{2}', w.read())

输出:

ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34  CONTENT ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34 ASD 23-02-34  CONTENT

CONTENT
CONTENT
CONTENTCONTENTCONTENT CONTENTCONTENT
CONTENTCONTENTCONTENTCONTENTCONTENT
CONTENTCONTENTCONTENTCONTENTCONTENT



[]
[]

如果我更改打印顺序,它将始终显示第一张打印,其余的将变为空白...我认为另一件事是通过使用组将多行打印成一行,但我不知道它是否会工作

1 个答案:

答案 0 :(得分:0)

>>>> import re
>>>> with open('extractfile.txt', 'r') as txt:
....     file = txt.read()

>>>> match = re.findall(r'CONTENT', file)
>>>> content
['CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT', 'CONTENT']

>>>> pattern = re.findall(r'(?P<asd>[\w]+ )(?P<dgt>[\d-]+)', file)
>>>> pattern
[('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34'), ('ASD ', '23-02-34')]

之后的[\w]+空格也可以通过移出而从<asd>组中排除,但是速度较慢,因为正则表达式最终会执行更多步骤。

re.findall(r'(?P<asd>[\w]+) (?P<dgt>[\d-]+)', file)