解析电子邮件并从正文中获取号码

时间:2011-11-29 21:25:47

标签: python regex email

我想提取电子邮件正文中找到的第一个号码。在电子邮件库的帮助下,我将邮件中的正文提取为字符串。但问题是,在真正的纯文本体开始之前,有一些关于编码的信息(这些包含数字)。如何以可靠的方式跳过那些不依赖于创建电子邮件的客户端而只是第一个数字的客户端。

如果我做了

match = re.search('\d+', string, re.MULTILINE)

它将获得有关编码或其他信息的第一个匹配,而不是实际的邮件内容。

确定。我添加了一个样本。这就是它的外观(我将提取123)。但我认为从另一个客户发出的信息可能会有所不同。

--14dae93404410f62f404b2e65e10
Content-Type: text/plain; charset=ISO-8859-1

Junk 123 Junk

--14dae93404410f62f404b2e65e10
Content-Type: text/html; charset=ISO-8859-1

<p>Junk 123 Junk</p>

--14dae93404410f62f404b2e65e10--

更新 现在我坚持使用迭代器: - /我真的试过了。但我不明白。这段代码:

msg = email.message_from_string(raw_message)
for part in email.iterators.typed_subpart_iterator(msg, 'text', 'plain'):
    print part

输出:

--14dae93404410f62f404b2e65e10
Content-Type: text/plain; charset=ISO-8859-1

Junk 123 Junk

--14dae93404410f62f404b2e65e10
Content-Type: text/html; charset=ISO-8859-1

<p>Junk 123 Junk</p>

--14dae93404410f62f404b2e65e10--

为什么不输出:

Junk 123 Junk

2 个答案:

答案 0 :(得分:6)

您可能希望使用迭代器跳过子部分标题。

http://docs.python.org/library/email.iterators.html#module-email.iterators

此示例将打印text / plain的每个消息子部分的正文:

for part in email.iterators.typed_subpart_iterator(msg, 'text', 'plain'):
   for body_line in email.iterators.body_line_iterator(part):
       print body_line

答案 1 :(得分:0)

你可以用这个:

match = re.search(r"Content-Type:.*?[\n\r]+\D*(\d+)", subject)
if match:
    result = match.group(1)

<强>解释

"
Content-Type:    # Match the characters “Content-Type:” literally
.                # Match any single character that is not a line break character
   *?               # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
[\n\r]           # Match a single character present in the list below
                    # A line feed character
                    # A carriage return character
   +                # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\D               # Match a single character that is not a digit 0..9
   *                # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(                # Match the regular expression below and capture its match into backreference number 1
   \d               # Match a single digit 0..9
      +                # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"