字典来计算maillog文件python中的电子邮件地址

时间:2015-12-23 20:21:17

标签: python

专家,我正在尝试计算maillog文件中的电子邮件地址和他们的重复次数,不知何故我可以使用正则表达式(re.search)或(re.match)进行,但我看这个用(re.findall)来完成,目前我正在考虑..将不胜感激任何建议..

1)代码行......

# cat maillcount31.py
#!/usr/bin/python
import re
#count = 0
mydic = {}
counts = mydic
fmt = " %-32s %-15s"
log =  open('kkmail', 'r')

for line in log.readlines():
        myre = re.search('.*from=<(.*)>,\ssize', line)
        if myre:
           name = myre.group(1)
           if name not in mydic.keys():
              mydic[name] = 0
           mydic[name] +=1

for key in counts:
   print  fmt % (key, counts[key])

2) Output from the Current code..

# python maillcount31.py
 root@MyServer1.myinc.com         13
 User01@MyServer1.myinc.com       14

3 个答案:

答案 0 :(得分:2)

希望这有帮助...

from collections import Counter
emails = re.findall('.*from=<(.*)>,\ssize', line)# Modify re according to your file pattern  OR line pattern. If findall() on each line, each returned list should be combined.
result = Counter(emails)# type is <class 'collections.Counter'>
dict(result)#convert to regular dict

re.findall()将返回一个列表。查看How can I count the occurrences of a list item in Python?,还有其他方法可以计算返回列表中的单词。

顺便说一句,Counter的有趣功能:

>>> tmp1 = Counter(re.findall('from=<([^\s]*)>', "from=<usr1@gmail.com>, from=<usr2@gmail.com>, from=<usr1@gmail.com>, from=<usr1@gmail.com>, from=<usr1@gmail.com>,") )
>>> tmp1
Counter({'usr1@gmail.com': 4, 'usr2@gmail.com': 1})
>>> tmp2 = Counter(re.findall('from=<([^\s]*)>', "from=<usr2@gmail.com>, from=<usr3@gmail.com>, from=<usr1@gmail.com>, from=<usr1@gmail.com>, from=<usr1@gmail.com>,") )
>>> dict(tmp1+tmp2)
{'usr2@gmail.com': 2, 'usr1@gmail.com': 7, 'usr3@gmail.com': 1}

因此,如果文件非常大,我们可以计算每一行并通过Counter。

组合它们

答案 1 :(得分:1)

您是否考虑过使用pandas,它可以为您提供一个很好的结果表,而无需使用正则表达式命令。

 import pandas as pd

 emails = pd.Series(email_list)
 individual_emails = emails.unique()

 tally = pd.DataFrame( [individual_emails , [0]*len(individual_emails)] )
 #makes a table with emails and a zeroed talley

 for item in individual_emails.index:
      address = tally.iloc[item,0]
      sum = len(email[email==address])

      tally.iloc[item,1] = sum


 print tally

答案 2 :(得分:1)

我希望底部的代码有帮助。

但是,通常需要注意以下三点:

  1. 打开文件时使用(with
  2. 在迭代字典时,请使用iteritems()
  3. 使用容器时,collections是您最好的朋友
  4. #!/usr/bin/python
    import re
    from collections import Counter 
    
    fmt = " %-32s %-15s"
    filename = 'kkmail'
    
    # Extract the email addresses
    email_list = []
    with open(filename, 'r') as log:
       for line in log.readlines():
          _re = re.search('.*from=<(.*)>,\ssize', line)
             if _re:
                name = _re.group(1)
                email_list.append(name)
    
    # Count the email addresses
    counts = dict(Counter(email_list)) # List to dict of counts: {'a':3, 'b':7,...}
    for key, val in counts.iteritems():
       print  fmt % (key, val)