我希望通过比较两个文件中的电子邮件,在两个文件中找到匹配的电子邮件并发送日期。我有两个文件1)maillog.txt(postfix maillog)和2)testmail.txt(包含由换行符分隔的电子邮件)我使用re
从maillog.txt文件中提取电子邮件和发送日期,如下所示,
Nov 3 10:08:43 server postfix/smtp[150754]: 78FA8209EDEF: to=<adamson@example.com>, relay=aspmx.l.google.com[74.125.24.26]:25, delay=3.2, delays=0.1/0/1.6/1.5, dsn=2.0.0, status=sent (250 2.0.0 OK 1509718076 m11si5060862pls.447 - gsmtp)
Nov 3 10:10:45 server postfix/smtp[150754]: 7C42A209EDEF: to=<addison@linux.com>, relay=mxa-000f9e01.gslb.pphosted.com[67.231.152.217]:25, delay=5.4, delays=0.1/0/3.8/1.5, dsn=2.0.0, status=sent (250 2.0.0 2dvkvt5tgc-1 Message accepted for delivery)
Nov 3 10:15:45 server postfix/smtp[150754]: 83533209EDE8: to=<johndoe@carchcoal.com>, relay=mxa-000f9e01.gslb.pphosted.com[67.231.144.222]:25, delay=4.8, delays=0.1/0/3.3/1.5, dsn=2.0.0, status=sent (250 2.0.0 2dvm8yww64-1 Message accepted for delivery)
Nov 3 10:16:42 server postfix/smtp[150754]: 83A5E209EDEF: to=<jackn@alphanr.com>, relay=aspmx.l.google.com[74.125.200.27]:25, delay=1.6, delays=0.1/0/0.82/0.69, dsn=2.0.0, status=sent (250 2.0.0 OK 1509718555 j186si6198120pgc.455 - gsmtp)
Nov 3 10:17:44 server postfix/smtp[150754]: 8636D209EDEF: to=<sbins@archcoal.com>, relay=mxa-000f9e01.gslb.pphosted.com[67.231.144.222]:25, delay=4.1, delays=0.11/0/2.6/1.4, dsn=2.0.0, status=sent (250 2.0.0 2dvm8ywwdh-1 Message accepted for delivery)
Nov 3 10:18:42 server postfix/smtp[150754]: 8A014209EDEF: to=<leo@adalphanr.com>, relay=aspmx.l.google.com[74.125.200.27]:25, delay=1.9, delays=0.1/0/0.72/1.1, dsn=2.0.0, status=sent (250 2.0.0 OK 1509718675 o2si6032950pgp.46 - gsmtp)
这是我的另一个文件testmail.txt
:
adamson@example.com
jdswson@gmail.com
以下是我尝试过的方法,它也有效但我想知道是否有更有效的方法来处理大量的maillogs和电子邮件地址
import re
pattern=r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>([\w\.-]+)@([\w\.-]+))'
with open("testmail.txt") as fh1:
for addr in fh1:
if addr:
with open("maillog.txt") as fh:
for line in fh:
if line:
match=re.finditer(pattern,line)
for obj in match:
addr=addr.strip()
addr2=obj.group('email').strip()
if addr == addr2:
print(obj.groupdict('email'))
如果找到匹配,这将打印出来:
{'month': 'Nov', 'day': '3', 'ts': '10:08:43', 'email': 'adamson@example.com'}
答案 0 :(得分:1)
这是我的解决方案
In [1]: import re
In [2]: pat = r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>([\w\.-]+)@([\w\.-]+))'
In [3]: emails = set()
In [4]: date_email = {}
In [6]: with open('maillog.txt', mode='r') as f:
...: for line in f:
...: month, day, ts, email = re.search(pat, line).group('month', 'day', 'ts', 'email')
...: date_email[email] = (month, day, ts)
...:
In [7]: date_email
Out[7]:
{'adamson@example.com': ('Nov', '3', '10:08:43'),
'addison@linux.com': ('Nov', '3', '10:10:45'),
'jackn@alphanr.com': ('Nov', '3', '10:16:42'),
'johndoe@carchcoal.com': ('Nov', '3', '10:15:45'),
'leo@adalphanr.com': ('Nov', '3', '10:18:42'),
'sbins@archcoal.com': ('Nov', '3', '10:17:44')}
In [11]: with open('testmail.txt', mode='r') as f:
...: for line in f:
...: emails.add(line.strip())
...:
In [12]: emails
Out[12]: {'adamson@example.com', 'jdswson@gmail.com'}
In [15]: for email in emails:
...: if email in date_email:
...: print(email, date_email[email])
...:
('adamson@example.com', ('Nov', '3', '10:08:43'))
您可以按照自己的方式格式化输出。
与&#34;以及#34;一起公开声明关键字可以像这样组合
with open(file1, mode='r') as f1, open(file2, mode='r') as f2:
# do something with f1
# do something with f2
答案 1 :(得分:1)
您可以尝试使用正则表达式并捕获该组:
让我们分三步解决您的解决方案:
首先从email.txt中捕获所有电子邮件地址:
emails=[]
with open('emails.txt','r') as f:
for line in f:
emails.append(re.search(email_pattern,line).group())
从data.txt中捕获所需数据的第二步:
with open('data.txt','r') as f:
month_day=[[find.group(4) if find.group(4) != None else [find.group(1), find.group(2), find.group(3)] for find in re.finditer(pattern,line)]for line in f]
第三步:现在我们拥有所有数据,只需检查该电子邮件是否在 我们的数据列表然后将该组信息添加到dict:
for item in month_day:
final_dict = {}
if item[1] in emails:
final_dict['month'] = item[0][0]
final_dict['day'] = item[0][1]
final_dict['ts'] = item[0][2]
final_dict['email'] = item[1]
if final_dict:
print(final_dict)
完整代码:
import re
pattern='^(\w{0,3})\s.(\d)\s(\d.+?\s)|<(\w+[@]\w+[.]\w+)>'
email_pattern='\w+[@]\w+[.]\w+'
emails=[]
with open('emails.txt','r') as f:
for line in f:
emails.append(re.search(email_pattern,line).group())
with open('data.txt','r') as f:
month_day=[[find.group(4) if find.group(4) != None else [find.group(1), find.group(2), find.group(3)] for find in re.finditer(pattern,line)]for line in f]
for item in month_day:
final_dict = {}
if item[1] in emails:
final_dict['month'] = item[0][0]
final_dict['day'] = item[0][1]
final_dict['ts'] = item[0][2]
final_dict['email'] = item[1]
if final_dict:
print(final_dict)
输出:
{'ts': '10:08:43 ', 'month': 'Nov', 'email': 'adamson@example.com', 'day': '3'}
正则表达式信息:
^ asserts position at start of a line
\w{0,3} matches any word character (equal to [a-zA-Z0-9_])
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\d matches a digit (equal to [0-9])
答案 2 :(得分:1)
快速且未经测试但在概念上足够简单:编写一个大的whoppin&#39;带有所有地址的正则表达式。
import re
with open("testmail.txt") as fh1:
emails = []
for addr in fh1:
emails.append(re.escape(addr.strip()))
pattern=re.compile(
r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>%s)' %
'|'.join(emails))
with open("maillog.txt") as fh:
for line in fh:
for match in pattern.finditer(line):
print(match.groupdict())
答案 3 :(得分:1)
我的建议是将来自testmail.txt的所有电子邮件存储在一个集合中,编译正则表达式,然后迭代maillog.txt的行,并在邮件中搜索集合。这样,只有较短的文件必须驻留在内存中,正则表达式模式只编译一次,并且研究是在为这种访问优化的集合中完成的:
import re
pattern=r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>([\w\.-]+)@([\w\.-]+))'
# load the testmail file into a set
mails = set()
with open('testmail.txt') as fd:
for line in fd:
mails.add(line.strip())
#compile the regex once
rx = re.compile(pattern)
#process the maillog file:
with open('maillog.txt') as fd:
for line in fd:
m = rx.match(line)
if m is not None and m.groupdict()['email'] in mails:
print(m.groupdict())
示例数据的输出符合预期:
{'month': 'Nov', 'day': '3', 'ts': '10:08:43', 'email': 'adamson@example.com'}