Python一次性使用正则表达式从正文中提取名称*和*电子邮件

时间:2014-10-26 02:50:57

标签: python regex email

Python3

我需要帮助创建一个正则表达式来从转发的电子邮件正文中提取姓名和电子邮件,这看起来与此类似(真实的电子邮件被虚拟电子邮件取代):

> Begin forwarded message:
> Date: December 20, 2013 at 11:32:39 AM GMT-3
> Subject: My dummy subject
> From: Charlie Brown <aaa@aa-aaa.com>
> To: maria.brown@aaa.com, George Washington <george@washington.com>, =
thomas.jefferson@aaa.com, thomas.alva.edison@aaa.com, Juan =
<juan@aaa.com>, Alan <alan@aaa.com>, Alec <alec@aaa.com>, =
Alejandro <aaa@aaa.com>, Alex <aaa@planeas.com>, Andrea =
<andrea.mery@thomsen.cl>, Andrea <andrea.22@aaa.com>, Andres =
<andres@aaa.com>, Andres <avaldivieso@aaa.com>
> Hi,
> Please reply ASAP with your RSVP
> Bye

我的第一步是将所有电子邮件提取到一个列表,其中包含我传递整个电子邮件正文的自定义函数,如下所示:

def extract_emails(block_of_text):
 t = r'\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b'
 return re.findall(t, block_of_text)

几天前我问question about extracting names using regex帮我构建函数来提取所有名称。我的想法是稍后加入。我接受了一个表达我所要求的答案,并提出了另一个功能:

def extract_names(block_of_text):
 p = r'[:,] ([\w ]+) \<'
 return re.findall(p, block_of_text)

我现在的问题是让提取的名称与提取的电子邮件相匹配,主要是因为有时名称少于电子邮件。所以我想,我可以更好地尝试构建另一个正则表达式来提取名称和电子邮件,

这是我尝试构建这样一个正则表达式的失败。

[:,]([\w \<]+)([\w.-]+@[\w.-]+\.[\w.-]+)

REGEX101 LINK

任何人都可以帮助并提出一个好的,干净的正则表达式,它将名称和电子邮件都写入元组列表或字典中吗?感谢

编辑: Python中正则表达式的预期输出将是这样的列表:

 [(Charlie Brown', 'aaa@aaa.com'),('','maria.brown@aaa.com'),('George Washington', 'george@washington.com'),('','thomas.jefferson@aaa.com'),('','thomas.alva.edison@aaa.com'),('Juan','juan@aaa.com',('Alan', 'alan@aaa.com'), ('Alec', 'alec@aaa.com'),('Alejandro','aaa@aaa.com'),('Alex', 'aaa@aaa.com'),('Andrea','andrea.mery@thomsen.cl'),('Andrea','andrea.22@aaa.com',('Andres','andres@aaa.com'),('Andres','avaldivieso@aaa.com')] 

1 个答案:

答案 0 :(得分:1)

好像你想要这样的东西。

[:,]\s*=?\s*(?:([A-Z][a-z]+(?:\s[A-Z][a-z]+)?))?\s*=?\s*.*?([\w.]+@[\w.-]+)

DEMO

>>> import re
>>> s = """ > Begin forwarded message:
>=20
> Date: December 20, 2013 at 11:32:39 AM GMT-3
> Subject: My dummy subject
> From: Charlie Brown <aaa@aa-aaa.com>
> To: maria.brown@aaa.com, George Washington <george@washington.com>, =
thomas.jefferson@aaa.com, thomas.alva.edison@aaa.com, Juan =
<juan@aaa.com>, Alan <alan@aaa.com>, Alec <alec@aaa.com>, =
Alejandro <aaa@aaa.com>, Alex <aaa@planeas.com>, Andrea =
<andrea.mery@thomsen.cl>, Andrea <andrea.22@aaa.com>, Andres =
<andres@aaa.com>, Andres <avaldivieso@aaa.com>
> Hi,
> Please reply ASAP with your RSVP
> Bye"""
>>> re.findall(r'[:,]\s*=?\s*(?:([A-Z][a-z]+(?:\s[A-Z][a-z]+)?))?\s*=?\s*.*?([\w.]+@[\w.-]+)', s)
[('Charlie Brown', 'aaa@aa-aaa.com'), ('', 'maria.brown@aaa.com'), ('George Washington', 'george@washington.com'), ('', 'thomas.jefferson@aaa.com'), ('', 'thomas.alva.edison@aaa.com'), ('Juan', 'juan@aaa.com'), ('Alan', 'alan@aaa.com'), ('Alec', 'alec@aaa.com'), ('Alejandro', 'aaa@aaa.com'), ('Alex', 'aaa@planeas.com'), ('Andrea', 'andrea.mery@thomsen.cl'), ('Andrea', 'andrea.22@aaa.com'), ('Andres', 'andres@aaa.com'), ('Andres', 'avaldivieso@aaa.com')]