Question

专家，我正在尝试计算maillog文件中的电子邮件地址和他们的重复次数，不知何故我可以使用正则表达式（re.search）或（re.match）进行，但我看这个用（re.findall）来完成，目前我正在考虑..将不胜感激任何建议..

1）代码行......

# cat maillcount31.py
#!/usr/bin/python
import re
#count = 0
mydic = {}
counts = mydic
fmt = " %-32s %-15s"
log =  open('kkmail', 'r')

for line in log.readlines():
        myre = re.search('.*from=<(.*)>,\ssize', line)
        if myre:
           name = myre.group(1)
           if name not in mydic.keys():
              mydic[name] = 0
           mydic[name] +=1

for key in counts:
   print  fmt % (key, counts[key])

2) Output from the Current code..

# python maillcount31.py
 root@MyServer1.myinc.com         13
 User01@MyServer1.myinc.com       14

Answer 1

希望这有帮助...

from collections import Counter
emails = re.findall('.*from=<(.*)>,\ssize', line)# Modify re according to your file pattern  OR line pattern. If findall() on each line, each returned list should be combined.
result = Counter(emails)# type is <class 'collections.Counter'>
dict(result)#convert to regular dict

re.findall（）将返回一个列表。查看How can I count the occurrences of a list item in Python?，还有其他方法可以计算返回列表中的单词。

顺便说一句，Counter的有趣功能：

>>> tmp1 = Counter(re.findall('from=<([^\s]*)>', "from=<usr1@gmail.com>, from=<usr2@gmail.com>, from=<usr1@gmail.com>, from=<usr1@gmail.com>, from=<usr1@gmail.com>,") )
>>> tmp1
Counter({'usr1@gmail.com': 4, 'usr2@gmail.com': 1})
>>> tmp2 = Counter(re.findall('from=<([^\s]*)>', "from=<usr2@gmail.com>, from=<usr3@gmail.com>, from=<usr1@gmail.com>, from=<usr1@gmail.com>, from=<usr1@gmail.com>,") )
>>> dict(tmp1+tmp2)
{'usr2@gmail.com': 2, 'usr1@gmail.com': 7, 'usr3@gmail.com': 1}

因此，如果文件非常大，我们可以计算每一行并通过Counter。

组合它们

Answer 2

您是否考虑过使用pandas，它可以为您提供一个很好的结果表，而无需使用正则表达式命令。

 import pandas as pd

 emails = pd.Series(email_list)
 individual_emails = emails.unique()

 tally = pd.DataFrame( [individual_emails , [0]*len(individual_emails)] )
 #makes a table with emails and a zeroed talley

 for item in individual_emails.index:
      address = tally.iloc[item,0]
      sum = len(email[email==address])

      tally.iloc[item,1] = sum


 print tally

Answer 3

我希望底部的代码有帮助。

但是，通常需要注意以下三点：

打开文件时使用（with）
在迭代字典时，请使用iteritems()
使用容器时，collections是您最好的朋友

#!/usr/bin/python
import re
from collections import Counter 

fmt = " %-32s %-15s"
filename = 'kkmail'

# Extract the email addresses
email_list = []
with open(filename, 'r') as log:
   for line in log.readlines():
      _re = re.search('.*from=<(.*)>,\ssize', line)
         if _re:
            name = _re.group(1)
            email_list.append(name)

# Count the email addresses
counts = dict(Counter(email_list)) # List to dict of counts: {'a':3, 'b':7,...}
for key, val in counts.iteritems():
   print  fmt % (key, val)

字典来计算maillog文件python中的电子邮件地址

3 个答案: