如何在没有签名或引用文本的情况下提取电子邮件正文

时间:2016-12-14 03:22:24

标签: email machine-learning nlp nltk

我可以使用哪些软件处理原始电子邮件文本以删除签名,引用的帖子文本等...

例如,这是一封电子邮件。我想得到“谢谢你们”。文本或更多文本,如果有更多文本。我不想要HTML签名(在第一个红色区块中)或该人回复的旧电子邮件(在第二个红色区块中)

enter image description here

1 个答案:

答案 0 :(得分:0)

您可以从email message handling package尝试Teigha

import email

with open('test.txt', 'r') as myfile:
    data=myfile.read()

body = email.message_from_string(data)
if body.is_multipart():
    for payload in body.get_payload():
        print(payload.get_payload().strip())
else:
    print(body.get_payload().strip())

输出:

this is the body text
this is the attachment text

test.txt文件包含以下内容。

From: John Doe <example@example.com>
MIME-Version: 1.0
Content-Type: multipart/mixed;
        boundary="XXXXboundary text"

This is a multipart message in MIME format.

--XXXXboundary text 
Content-Type: text/plain

this is the body text

--XXXXboundary text 
Content-Type: text/plain;
Content-Disposition: attachment;
        filename="test.txt"

this is the attachment text

--XXXXboundary text--