我正在使用Python来阅读Enron电子邮件数据集。我有文本文件中的电子邮件。我想阅读文本文件并仅提取每封电子邮件的“正文”部分。我并不关心任何其他FROM
,TO
,BCC
,attachments
,DATE
等。我只想要BODY
部分和想把它存放在一个列表中。我尝试使用get_payload()
函数,但它仍会打印所有内容。如何跳过其他内容并仅使用“正文”部分?
import email.parser
from email.parser import Parser
# Code to extract a particular section from raw emails.
parser = Parser()
text1 = open("path of the file", "r").read()
msg = email.message_from_string(text1)
email = parser.parsestr(text1)
if msg.is_multipart():
for payload in msg.get_payload():
print payload.get_payload()
else:
print msg.get_payload()
一个文件可能包含多封电子邮件。电子邮件示例。
docID: 1
segmentNumber: 0
Body: I just checked with Carolyn on your invoicing for the conference. She
verified the 85K was processed.
##########################################################
docID: 2
segmentNumber: 0
Body: null
##########################################################
docID: 3
segmentNumber: 0
Body: In regard to the costs for the GAM conference, Karen told me the $ 6,695.97
figure was inclusive of all the items for the conference. However, after
speaking with Shweta, I found out this is not the case. The CDs are not
included in this figure.
The CD cost will be $2,011.50 + the cost of postage/handling (which is
currently being tabulated).
##########################################################
docID: 3
segmentNumber: 1
Body:
This is the original quote for this project and it did not include the
postage. As soon as I have the details from the vendor, I'll forward those to
you.
Please call me if you have any questions.
答案 0 :(得分:0)
假设您的所有文件都具有示例中指定的格式,这可能有效:
email_body_list = [ email.split('Body:')[-1] for email in file_content.split('##########################################################')]