Question

我正在使用Python来阅读Enron电子邮件数据集。我有文本文件中的电子邮件。我想阅读文本文件并仅提取每封电子邮件的“正文”部分。我并不关心任何其他FROM，TO，BCC，attachments，DATE等。我只想要BODY部分和想把它存放在一个列表中。我尝试使用get_payload()函数，但它仍会打印所有内容。如何跳过其他内容并仅使用“正文”部分？

import email.parser
from email.parser import Parser

# Code to extract a particular section from raw emails.

parser = Parser()
text1 = open("path of the file", "r").read()
msg = email.message_from_string(text1)
email = parser.parsestr(text1)

if msg.is_multipart():
    for payload in msg.get_payload():
       print payload.get_payload()
else:
    print msg.get_payload()

一个文件可能包含多封电子邮件。电子邮件示例。

docID:  1
segmentNumber:  0
Body:   I just checked with Carolyn on your invoicing for the conference.  She 
verified the 85K was processed.

##########################################################
docID:  2
segmentNumber:  0
Body:   null
##########################################################
docID:  3
segmentNumber:  0
Body:   In regard to the costs for the GAM conference, Karen told me the $ 6,695.97 
figure was inclusive of all the items for the conference.  However, after 
speaking with Shweta, I found out this is not the case.  The CDs are not 
included in this figure.  

The CD cost will be $2,011.50 + the cost of postage/handling (which is 
currently being tabulated).


##########################################################
docID:  3
segmentNumber:  1
Body:   
This is the original quote for this project and it did not include the 
postage. As soon as I have the details from the vendor, I'll forward those to 
you.
Please call me if you have any questions.

Answer 1

假设您的所有文件都具有示例中指定的格式，这可能有效：

email_body_list = [ email.split('Body:')[-1] for email in file_content.split('##########################################################')]

使用Python仅提取电子邮件正文

1 个答案: