解析电子邮件标题

时间:2018-10-14 01:28:03

标签: python regex

我有一小段代表电子邮件的文本文件。

s="""Joe Hillings@ENRON
09/08/99 02:52 PM
To: Joe Hillings/Corp/Enron@Enron
cc: Sanjay Bhatnagar/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Terence H 
Thorn/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Ashok 
Mehta/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, John 
Ambler/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Steven J Kean/HOU/EES@EES, 
Jeffrey Sherrick/Corp/Enron@Enron 
Subject: Re: India And The WTO Services Negotiation  
"""

我想提取电子邮件的每个标头(“发件人”,“收件人”,“抄送”,“主题”)(包括上面的时间)

为了测试,我试图提取上述字符串的To和cc字段。

我这样做如下:

regex=r"To:(?P<To>.*)\ncc:(?P<cc>.*)"

res=re.search(regex,s,re.M)

print("To: {}".format(res.group("To")))
print("cc: {}".format(res.group("cc")))

输出:

To:  Joe Hillings/Corp/Enron@Enron
cc:  Sanjay Bhatnagar/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Terence H 

这似乎可行,但是只选择对应于一行的数据,而忽略同一标题的另一行中的所有其他数据。与“ cc”一样,仅选择第一行。

如果我现在在正则表达式中添加“主题”标头,则会引发错误

regex1=r"To:(?P<To>.*)\ncc:(?P<cc>.*)\nSubject:(?P<Subject>.*)"

res=re.search(regex1,s,re.M)
print("To: {}".format(res.group("To")))
print("cc: {}".format(res.group("cc")))

输出:

AttributeError: 'NoneType' object has no attribute 'group'

任何帮助我哪里出错了,为什么会非常感谢。

谢谢

编辑:

对于txt文件中的多封电子邮件,下面建议的答案当前仅提取文件中最后一封电子邮件的标头,而忽略txt文件中的先前电子邮件。

s1="""Message-ID: <28937390.1075853126342.JavaMail.evans@thyme>
Date: Thu, 26 Jul 2001 06:54:59 -0700 (PDT)
From: michelle.cash@enron.com
To: rob.walls@enron.com
Subject: RE: Confidential Concern
Cc: sharon.butcher@enron.com, a..hope@enron.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Bcc: sharon.butcher@enron.com, a..hope@enron.com
X-From: Cash, Michelle </O=ENRON/OU=NA/CN=RECIPIENTS/CN=MCASH>
X-To: Walls Jr., Rob </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Rwalls>
X-cc: Butcher, Sharon </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Sbutche>, Hope, Valeria A. </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Vhope>
X-bcc: 
X-Folder: \MCASH (Non-Privileged)\Cash, Michelle\Sent Items
X-Origin: Cash-M
X-FileName: MCASH (Non-Privileged).pst

Sharon, I suggest that we ask Valeria Hope to investigate the fact situation here and report back to us jointly.  What do you think?  Michelle

 -----Original Message-----
From:Walls Jr., Rob  
Sent:Wednesday, July 25, 2001 5:53 PM
To: Cash, Michelle
Cc: Butcher, Sharon
Subject:FW: Confidential Concern

Michelle -

Since this is in Venezuela and thus part of wholesale, I am sending you a copy of this letter for you to review.  I'm not sure who should take the lead between you and Sharon but I'll leave that to you guys to work out.  Please let me know who is taking the lead on this.  Thanks.

 -----Original Message-----
From:   Sera, Sherri   On Behalf Of Office of the Chairman,
Sent:   Wednesday, July 25, 2001 10:54 AM
To: Fleming, Rosalee; Clark, Mary
Cc: Butcher, Sharon; Walls Jr., Rob; Kean, Steven J.
Subject:    Confidential Concern

I'm not sure I understand what has happened to this guy, but it's something that should be handled post haste.  Thanks, SRS
---------------------- Forwarded by Sherri Sera/Corp/Enron on 07/25/2001 10:52 AM ---------------------------

 << OLE Object: Picture (Device Independent Bitmap) >> 
Anonymous

From:   Anonymous on 07/23/2001 02:08 PM
To: 
cc:  

Subject:    Confidential Concern


 << File: Ken Lay - Jeff Skilling.doc >> 

"""

输出(“编辑”部分):

Time:  07/23/2001 02:08 PM
To:     
cc:      


    Confidential Concern

1 个答案:

答案 0 :(得分:0)

您可以按以下方式使用:

import re

s="""Joe Hillings@ENRON
09/08/99 02:52 PM
To: Joe Hillings/Corp/Enron@Enron
cc: Sanjay Bhatnagar/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Terence H 
Thorn/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Ashok 
Mehta/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, John 
Ambler/ENRON_DEVELOPMENT@ENRON_DEVELOPMENT, Steven J Kean/HOU/EES@EES, 
Jeffrey Sherrick/Corp/Enron@Enron 
Subject: Re: India And The WTO Services Negotiation  
"""
#regex=r"To:(?P<To>.*)\ncc:(?P<cc>.*)"
regex=r"(.|\n)*?(?P<xtime>[\d\/: ]+(AM|PM))\nTo:(?P<To>.*)\ncc:(?P<cc>(.|\n)*?)Subject:(?P<Subject>.*)"

res=re.search(regex,s,re.M)

print("time: {}".format(res.group("xtime")))
print("To: {}".format(res.group("To")))
print("cc: {}".format(res.group("cc")))
print("Subject: {}".format(res.group("Subject")))

要传递多行,您必须使用(.|\n)*,并且标记?表示不贪心!