正则表达式 - python提取电子邮件标题名称

时间:2016-06-13 12:09:31

标签: python regex

您正在寻找一种方法从此文本块中提取标题名称(以粗体显示)(原始来自mbox文件) 我试过这个正则表达式,用于sublime文本正则表达式搜索,但没有在python上工作 ^ \ W + - (\ W +) - (\ W +):????

rgex = re.findall('^\w+-?(\w+)?-?(\w+)?:', mail);

这就是内部邮件

  

X-Itarently-To :test@yahoo.com; 2016年6月9日星期四13:41:21 +0000   
退货路径:   
收到SPF :通过(yahoo.com域名指定72.30.235.45为允许的发件人)   
收到:来自127.0.0.1(EHLO n3-vm9.bullet.mail.bf1.yahoo.com)(72.30.235.45),mta1287.mail.ne1.yahoo.com,SMTPS ; 6月9日星期四   
2016 2016 13:41:21 +0000   
DKIM-Signature :v = 1; A = RSA-SHA256; C =轻松/放松; d = yahoo-inc.com; S = yibm; T = 1465479679;   
:test@yahoo.com   
来自:“雅虎”   
回复:“Yahoo”   
X-YMailISG :PCypxycWLDvGv4Bg8ShrtzVYi3vpFMAjYaqWyWybcVJ_ZQff eyquyqb..Qu6UKhX_Tyz5b3da2iDtRStJpVnNulZHOb8GznJQTCKk9sjvboS   KsbzY4E1uScWz0Ieo0jjG0YHrB1dTCzOSeMiPNumCCFS1sR3_SkyMBGG_D2D   wWtdRducxLa2YgEMMubVpMtNJMBv.bwk0.E.jQNEy8I3LnJEqcDpmIUM7bZL   XgkEFz7yl1Zo6Sj4r0z6pGlVIFOql7uG9Bwq2VJoK1Q1upKJUOBfQqzf64y2   9fXLnQsWENpZloxwncGzLhdzEYGgE3xNuFV8QFxZGXyvtKZFoykH49M03URN   jtx8Yg6ypjyRbBIRVJGVFbjAvW6io3yeyIFh042jlgYQtLxbneFA60hn9ifT   Mit3bQ5l7Tginw0OgRM2cbqLo0tEZFt9vlN597Z3vPGwsVdBcTp9wnk6orj2   TqjEpAmODy3Yru2HzDP7Dbwq9CGaIozUm91VNWqw5Dy7AMQEsuvnBop7Fflk   G21m1WKMBgrS.2bOLQ4797E09LjlyyoWI9FouUNNhDljnPPf2AeKUKzauctw   ULOQPveWAm4lDsNLMp5yvXDYNIe5HMor84SVd8_xF3Icna1PAftXGzJUHrXK   NZSEN_VO0GprGfaNQg4uSW_0wXFXwC6TYQ4CMjz53o0qNGpILogVfRLwFCFL   DtW8nimkLLsNzmDajzJsR_juA86Orw2NE5ED4qdpPxmyxyrXYOQPu3O6zeYf   7mBzU0aX7VHJUxJ4L3HdB9qTjbTaCdnySrnjGtd7u9Cn9yRJirDNeg3UA82P   PeA1ZDfc0vKdrn5QI6e6YKa2TTt7Dspy3jObgSapH5epc3LyQVyN7yjpxrq_   MXAbpqedjUfcwq3c7lpt8xxUxy.MXWg0fJO059xijvb_sYTaQTGUWAMeVU.6   IW.hSksejwpn._CgE9Kqabbk5qgYIdYRW1pmz5OBYh0skCX1TrFRuxbGvDit   R_wr.wbTpJGiSST.b0ZetmgN72bVvlRtmNPw1Dk.zxaacXxhGSMWupPUDLJZ   OMrap2ax8oiQrxT3jIhk8seIkaNJ.tGUhlPx6G4lJJaz0g89LmjBaEjGUG8P   W3Phh9db3hjxUIX5UC0jg5ai2XZ7u_wXn2Muk61N1eRCZ0oA2S25YDPK1dh。   3VQ6pH8SSBxVkQHUJXbZUNqLAzi5V5wRS7oeitXERGgA2DiZB268.rJxS7di   OMT5eGoITG4LnAo1M3nsVQ6xceHDd4v6KD9KfBgTHX_iLUv_skCv4dVUgVvj   edKOFiOMHBTpJ9J9BECjTTzEUpc.fCNUcRwSsiSkqbRhUsAdCbxQZir3Nb1Z   6FzI6J2eNqpj4azjmDeI15R8MyN7VFc6bl6pCZySk2Tx5SQESDm.sVkADSVR   pI2nuscEjU3xo_qGUxbh5mbAA17K2zYpcFXaOce8_9Eszos5pURCcdtBYUqI   I_DOtvNe.zWY1ShRcr9ZzTj3ibmc7NBmvumhVMjqirb12mfJ6oxHv8d86gze   HtAJmJghczUg5otSzdxSgEJJxjMZrzSidJ9FP.gPiPWtuukz82YpZ32MnCVs    6.V2DRxpUmZa31KH93QSEzwMlCn3FFTLBv9izcjoFP81yeAn.3QloF8XIC3K WmtXtloyeGjuygAhlkd_prXmMGGC5JmPlY8xu4k1NavkdDh6pG6zIkt83Wsd p.D.0BgM   
X-Originating-IP :[75.30.245.45]   
身份验证 - 结果:来自= yahoo-inc.com的mta1287.mail.ne1.yahoo.com; domainkeys =中性(没有sig);从= yahoo-inc.com;   dkim =通过(ok)

4 个答案:

答案 0 :(得分:1)

比设计适当的正则表达式更简单的方法可能是使用python附带的更合适的工具... email.parser模块,用于解析rcf822这样的消息。

>>> from email import parser
>>> txt = """X-Apparently-To: test@yahoo.com; Thu, 09 Jun 2016 13:41:21 +0000 
... Return-Path: 
... Received-SPF: pass (domain of yahoo.com designates 72.30.235.45 as permitted sender) 
... Received: from 127.0.0.1 (EHLO n3-vm9.bullet.mail.bf1.yahoo.com) (72.30.235.45) by mta1287.mail.ne1.yahoo.com with SMTPS; Thu, 09 Jun 2016 13:41:21 +0000
... DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo-inc.com; s=yibm; t=1465479679; 
... To: test@yahoo.com 
... From: "Yahoo" 
... Reply-To: "Yahoo"
... X-YMailISG: PCypxy...
... X-Originating-IP: [75.30.245.45] 
... Authentication-Results: mta1287.mail.ne1.yahoo.com from=yahoo-inc.com; domainkeys=neutral (no sig); from=yahoo-inc.com; dkim=pass (ok)
... """
>>> msg = parser.Parser().parsestr(txt, headersonly=True)
>>> print(msg.keys())
['X-Apparently-To', 'Return-Path', 'Received-SPF', 'Received', 'DKIM-Signature', 'To', 'From', 'Reply-To' 'X-YMailISG', 'X-Originating-IP', 'Authentication-Results']

答案 1 :(得分:0)

如果您在整个mbox文件上运行正则表达式,那么正则表达式将无效 - 您将不得不编写程序。原因是消息正文可能具有与标头令牌完全匹配的令牌。

假设您只在mbox文件的标题部分上运行正则表达式,然后查看email RFC(第2.2节),那么以下正则表达式应该有效:

  

' ^([^:] +):'

答案 2 :(得分:0)

您的'^\w+-?(\w+)?-?(\w+)?:'正则表达式匹配字符串的开头(^),然后是1个字符字符,后跟可选的-,然后将1个字符字符捕获到可选字符串中第1组(由re.findall返回,作为列表返回的每个元组中的第一项),然后是可选的连字符,同样是一个匹配1+个字符的捕获组(可选,但仍然作为第2个返回元组中的项目,最后是:。由于^-和2个捕获组之间的可选\w,它无效。

如果您获得的输入符合rfc 8222消息样式,则应考虑切换到上面给出的from email import parser解决方案。

或者,在我看来,你可以捕获除空白之外的所有字符和冒号,直到冒号后面跟着行 的空格

r"^([^\s:]+):\s"

并将其与re.findallre.M标志一起使用。

正则表达式解释

  • ^ - 行的开头(re.M使^与换行符后的位置匹配)或字符串的开头)
  • ([^\s:]+) - 捕获第1组,其中包含除空白和冒号
  • 之外的1+个字符
  • : - 冒号
  • \s - 一个空格字符。

请参阅regex demo

Python demo其中re.findall仅返回捕获的文本:

import re
p = re.compile(r'^([^\s:]+):\s', re.MULTILINE)
test_str = "X-Apparently-To: test@yahoo.com; Thu, 09 Jun 2016 13:41:21 +0000 \nReturn-Path: \nReceived-SPF: pass (domain of yahoo.com designates 72.30.235.45 as permitted sender) \nReceived: from 127.0.0.1 (EHLO n3-vm9.bullet.mail.bf1.yahoo.com) (72.30.235.45) by mta1287.mail.ne1.yahoo.com with SMTPS; Thu, 09 Jun \n2016 13:41:21 +0000 \nDKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo-inc.com; s=yibm; t=1465479679; \nTo: test@yahoo.com \nFrom: \"Yahoo\" \nReply-To: \"Yahoo\" \nX-YMailISG: PCypxycWLDvGv4Bg8ShrtzVYi3vpFMAjYaqWyWybcVJ_ZQff eyquyqb..Qu6UKhX_Tyz5b3da2iDtRStJpVnNulZHOb8GznJQTCKk9sjvboS KsbzY4E1uScWz0Ieo0jjG0YHrB1dTCzOSeMiPNumCCFS1sR3_SkyMBGG_D2D wWtdRducxLa2YgEMMubVpMtNJMBv.bwk0.E.jQNEy8I3LnJEqcDpmIUM7bZL XgkEFz7yl1Zo6Sj4r0z6pGlVIFOql7uG9Bwq2VJoK1Q1upKJUOBfQqzf64y2 9fXLnQsWENpZloxwncGzLhdzEYGgE3xNuFV8QFxZGXyvtKZFoykH49M03URN jtx8Yg6ypjyRbBIRVJGVFbjAvW6io3yeyIFh042jlgYQtLxbneFA60hn9ifT Mit3bQ5l7Tginw0OgRM2cbqLo0tEZFt9vlN597Z3vPGwsVdBcTp9wnk6orj2 TqjEpAmODy3Yru2HzDP7Dbwq9CGaIozUm91VNWqw5Dy7AMQEsuvnBop7Fflk G21m1WKMBgrS.2bOLQ4797E09LjlyyoWI9FouUNNhDljnPPf2AeKUKzauctw ULOQPveWAm4lDsNLMp5yvXDYNIe5HMor84SVd8_xF3Icna1PAftXGzJUHrXK NZSEN_VO0GprGfaNQg4uSW_0wXFXwC6TYQ4CMjz53o0qNGpILogVfRLwFCFL DtW8nimkLLsNzmDajzJsR_juA86Orw2NE5ED4qdpPxmyxyrXYOQPu3O6zeYf 7mBzU0aX7VHJUxJ4L3HdB9qTjbTaCdnySrnjGtd7u9Cn9yRJirDNeg3UA82P PeA1ZDfc0vKdrn5QI6e6YKa2TTt7Dspy3jObgSapH5epc3LyQVyN7yjpxrq_ MXAbpqedjUfcwq3c7lpt8xxUxy.MXWg0fJO059xijvb_sYTaQTGUWAMeVU.6 IW.hSksejwpn._CgE9Kqabbk5qgYIdYRW1pmz5OBYh0skCX1TrFRuxbGvDit R_wr.wbTpJGiSST.b0ZetmgN72bVvlRtmNPw1Dk.zxaacXxhGSMWupPUDLJZ OMrap2ax8oiQrxT3jIhk8seIkaNJ.tGUhlPx6G4lJJaz0g89LmjBaEjGUG8P W3Phh9db3hjxUIX5UC0jg5ai2XZ7u_wXn2Muk61N1eRCZ0oA2S25YDPK1dh. 3VQ6pH8SSBxVkQHUJXbZUNqLAzi5V5wRS7oeitXERGgA2DiZB268.rJxS7di OMT5eGoITG4LnAo1M3nsVQ6xceHDd4v6KD9KfBgTHX_iLUv_skCv4dVUgVvj edKOFiOMHBTpJ9J9BECjTTzEUpc.fCNUcRwSsiSkqbRhUsAdCbxQZir3Nb1Z 6FzI6J2eNqpj4azjmDeI15R8MyN7VFc6bl6pCZySk2Tx5SQESDm.sVkADSVR pI2nuscEjU3xo_qGUxbh5mbAA17K2zYpcFXaOce8_9Eszos5pURCcdtBYUqI I_DOtvNe.zWY1ShRcr9ZzTj3ibmc7NBmvumhVMjqirb12mfJ6oxHv8d86gze HtAJmJghczUg5otSzdxSgEJJxjMZrzSidJ9FP.gPiPWtuukz82YpZ32MnCVs 6.V2DRxpUmZa31KH93QSEzwMlCn3FFTLBv9izcjoFP81yeAn.3QloF8XIC3K WmtXtloyeGjuygAhlkd_prXmMGGC5JmPlY8xu4k1NavkdDh6pG6zIkt83Wsd p.D.0BgM \nX-Originating-IP: [75.30.245.45] \nAuthentication-Results: mta1287.mail.ne1.yahoo.com from=yahoo-inc.com; domainkeys=neutral (no sig); from=yahoo-inc.com; dkim=pass (ok)"
print(p.findall(test_str))

<强>更新

现在,由于您要求仅获取值,您可以使用相同的方法,但只需在找到时删除键并将值添加到结果列表中:

txt = "YOUR_STRING_HERE"
values = []                    # Resulting value list
start_matching = False         # Bool flag to start matching the key-value pairs
val = ""                       # Temp string to keep multiline values
for line in txt.split("\n"):   #  Split the input into lines
    if re.match(r"[^\s:]+:\s", line.strip()): # Check if the entry is found
        start_matching = True   # Start matching
        if val:                 # If a val is initialized, 
            values.append(val)  #    we save it to our list
            val = ""            # Reset the temp string value
        val += re.sub(r"^[^\s:]+:\s", "", line.strip()) # Append the value string start
    else:
        if start_matching:      # If matching has started,
            val += "{}\n".format(line.strip()) # add the line to the value found
print(values)

请参阅IDEONE demo

答案 3 :(得分:0)

Python提供了可以为您执行这些低级任务的电子邮件包,但如果您想要学习电子邮件标题,那么引用就是RFC5322(formely RFC822)

在其他敏感信息中,您可以找到标题字段的定义:

  

标题字段是以字段名开头的行,后跟a      冒号(&#34;:&#34;),后跟字段正文,并由CRLF终止。一个      字段名称必须由可打印的US-ASCII字符组成(即,      除了之外,其值为33到126之间的字符      结肠。字段主体可以由可打印的US-ASCII字符组成      以及空格(SP,ASCII值32)和水平制表符(HTAB,      ASCII值9)字符(一起称为空格      角色,WSP)。场体不得包括CR和LF除外      用于&#34;折叠&#34;和&#34;展开&#34;

折叠后来定义为:

  

标题字段的字段主体部分可以拆分为      多线表示;这被称为&#34;折叠&#34;。一般      规则是该规范允许折叠白色的任何地方      空间(不仅仅是WSP字符),可以在任何之前插入CRLF      WSP。

这意味着:

  • 如果某行未以WSP(正则表达式中为\s)开头,则列到列的开头是标题名称。
  • 当一行以WSP开头时,它是一个续行。

所以这个正则表达式应该足够了:'([\x21-\x7e]+?):'