Question

I am trying to parse a log file to extract email addresses. I am able to match the email and print it with the help of regular expressions. I noticed that there are a couple of duplicate emails in my log file. Can you help me in figuring out how I can remove the duplicates and print only the unique email addresses based on matched patterns.

Here is the code I have written so far :

import sys
import re

file = open('/Users/me/Desktop/test.txt', 'r')
temp =[]
for line in file.readlines():
    if '->' in line:
        temp = line.split('->')
    elif '=>' in line:
        temp = line.split('=>')

    if temp:
        #temp[1].strip()
        pattern = re.match('^\x20\w{1,}@\w{1,}\.\w{2,3}\x20?', str(temp[1]), re.M)
        if pattern is not None:
            print pattern.group()

        else:
            print "nono"

Here is my example log file that I am trying to parse:

Feb 24 00:00:23 smtp1.mail.net exim[5660]: 2014-02-24 00:00:23 1Wuniq-mail-idSo-Fg -> someuser@somedomain.com R=mail T=remote_smtp H=smtp.mail.net [000.00.34.17]

Feb 24 00:00:23 smtp1.mail.net exim[5660]: 2014-02-24 00:00:23 1Wuniq-mail-idSo-Fg -> someuser@somedomain.com R=mail T=remote_smtp H=smtp.mail.net [000.00.34.17]

Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => someuser@somedomain.com R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]

Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => me@somedomain.com R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]

Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => wo@somedomain.com R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]

Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => lol@somedomain.com R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]

Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h Completed

Also, I am curious if I can improve my program or the regex. Any suggestion would be very helpful.

Thanks in advance.

Answer 1

正如danidee（他是第一个）所说，set会做到这一点

试试这个：

from __future__ import print_function

import re

with open('test.txt') as f:
    data = f.read().splitlines()

emails = set(re.sub(r'^.*\s+(\w+\@[^\s]*?)\s+.*', r'\1', line) for line in data if '@' in line)

print('\n'.join(emails)) if len(emails) else print('nono')

输出：

lol@somedomain.com
me@somedomain.com
someuser@somedomain.com
wo@somedomain.com

PS你可能想要做一个正确的电子邮件RegExp检查，因为我使用了非常原始的检查

Answer 2

You can use a set container in order to preserve the unique results and each time that you want to print a matched email you can check if it doesn't exist in your set you print it:

import sys
import re

file = open('/Users/me/Desktop/test.txt', 'r')
temp =[]
seen = set()
for line in file.readlines():
    if '->' in line:
        temp = line.split('->')
    elif '=>' in line:
        temp = line.split('=>')

    if temp:
        #temp[1].strip()
        pattern = re.match('^\x20\w{1,}@\w{1,}\.\w{2,3}\x20?', str(temp[1]), re.M)
        if pattern is not None:
            matched =  pattern.group()
            if matched not in seen:
               print matched 

        else:
            print "nono"

Answer 3

部分重复是由于您的代码中的一个错误，您在处理每一行时不会重置temp。不包含->或=>且前面有一行包含其中一个字符串的行将触发{ {1}}测试，并输出上一行中的电子邮件地址（如果有的话）。

如果该行既不包含if temp:也不包含continue，则可以通过->跳回到循环的开头来解决此问题。

对于因多个行显示相同电子邮件地址而发生的其他正版重复项，您可以使用=>过滤掉这些重复项。

set

地址存储在一个集合中以删除重复项。然后对它们进行分类和打印。另请注意使用import sys import re addresses = set() pattern = re.compile('^\x20\w{1,}@\w{1,}\.\w{2,3}\x20?') with open('/Users/me/Desktop/test.txt', 'r') as f: for line in f: if '->' in line: temp = line.split('->') elif '=>' in line: temp = line.split('=>') else: # neither '=>' nor '->' present in the line continue match = pattern.match(temp[1]) if match is not None: addresses.add(match.group()) else: print "nono" for address in sorted(addresses): print(address)语句在上下文管理器中打开文件。这可以保证文件永远关闭。

此外，由于您将多次应用相同的正则表达式模式，因此值得提前编译以提高效率。

使用正确编写的正则表达式模式，您的代码可以大大简化：

with

Sorting the unique values from regex match in python

3 个答案: