只读包含特定字符串的行并在其上应用正则表达式

时间:2013-04-18 14:01:16

标签: python regex readfile readlines

这是我的代码:我有一个脚本可以读取文件,但在我的文件中并不是所有的行都相似,我只想从I DOC O:的行中提取信息。

我尝试使用if条件但是当有正则表达式不匹配的行时它仍然无效:

#!/usr/bin/env python 

# -*- coding: utf-8 -*-

import re 

def extraire(data):
    ms = re.match(r'(\S+).*?(O:\S+).*(R:\S+).*mid:(\d+)', data) # heure & mid 
    return {'Heure':ms.group(1), 'mid':ms.group(2),"Origine":ms.group(3),"Destination":ms.group(4)}

tableau = []  

fichier = open("/home/TEST/file.log")
f = fichier.readlines() 
for line in f: 
    if (re.findall(".*I Doc O:.*",line)):     
    tableau = [extraire(line) for line in f ]

print tableau
fichier.close()

以下是我文件中某些行的示例,我想要第一行和第四行..:

01:09:25.258 mta         Messages       I Doc O:NVS:SMTP/alarm@yyy.xx R:NVS:SMS/+654811 mid:6261
01:09:41.965 mta         Messages       I Rep O:NVS:SMTP/alarmes.techniques@xxx.de R:NVS:SMS/+455451 mid:6261
01:09:41.965 mta         Messages       I Rep 6261 OK, Accepted (ID: 26)
08:14:14.469 mta         Messages       I Doc O:NVS:SMTP/alarm@xxxx.en R:NVS:SMS/+654646 mid:6262
08:14:30.630 mta         Messages       I Rep O:NVS:SMTP/alarm@azea.er R:NVS:SMS/+33688704859 mid:6262
08:14:30.630 mta         Messages       I Rep 6262 OK, Accepted (ID: 28)

1 个答案:

答案 0 :(得分:0)

来自:http://docs.python.org/2/library/re.html

?,+?,??     '','+'和'?'资格赛都是贪心的;它们匹配尽可能多的文本。有时这种行为是不可取的;如果RE<。*>匹配...

此外,findall最好用于整个缓冲区,并返回一个列表,因此循环匹配可以使您不必对文件的每一行进行条件化。

buff = fichier.read()
matches = re.findall(".*?I Doc ):.*", buff)
for match in matches:
    tableau = ...

- 这是我的测试代码,你能告诉我它在做什么,你不想要的吗?

>>> import re
>>> a = """
... 01:09:25.258 mta         Messages       I Doc O:NVS:SMTP/alarm@yyy.xx R:NVS:SMS/+654811 mid:6261
... 01:09:41.965 mta         Messages       I Rep O:NVS:SMTP/alarmes.techniques@xxx.de R:NVS:SMS/+455451 mid:6261
... 01:09:41.965 mta         Messages       I Rep 6261 OK, Accepted (ID: 26)
... 08:14:14.469 mta         Messages       I Doc O:NVS:SMTP/alarm@xxxx.en R:NVS:SMS/+654646 mid:6262
... 08:14:30.630 mta         Messages       I Rep O:NVS:SMTP/alarm@azea.er R:NVS:SMS/+33688704859 mid:6262
... 08:14:30.630 mta         Messages       I Rep 6262 OK, Accepted (ID: 28)"""
>>> m = re.findall(".*?I Doc O:.*",a)
['01:09:25.258 mta         Messages       I Doc O:NVS:SMTP/alarm@yyy.xx R:NVS:SMS/+654811 mid:6261', '08:14:14.469 mta         Messages       I Doc O:NVS:SMTP/alarm@xxxx.en R:NVS:SMS/+654646 mid:6262']

>>> tableau = []
>>> for line in m:
...     tableau.append( extraire(line) )
... 
>>> tableau
[{'Origine': 'R:NVS:SMS/+654811', 'Destination': '6261', 'Heure': '01:09:25.258', 'mid': 'O:NVS:SMTP/alarm@yyy.xx'}, {'Origine': 'R:NVS:SMS/+654646', 'Destination': '6262', 'Heure': '08:14:14.469', 'mid': 'O:NVS:SMTP/alarm@xxxx.en'}]

您也可以在一行中执行此操作

>>> tableau = [ extraire(line) for line in re.findall( ".*?I Doc ):.*", fichier.read() ) ]