从包含电子邮件以及其他文件的目录中仅获取电子邮件txt文件的主体以及其他文本文件的所有内容

时间:2019-02-17 21:01:29

标签: python email multipart payload summarization

我只想获取电子邮件txt文件的主体,而没有to和from标签,从包含电子邮件以及其他文件的目录中删除页眉和页脚以及其他文本文件的所有内容。我的问题是如何在代码中正确使用is_multipart和get_payload内容,以便仅对电子邮件文件执行操作

import os, sys, csv
import glob
import re
import email
#from tika import parser
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.summarization import summarize, keywords

# Set path to directory where files are
dirs = 'C:\\Users\\Lenovo\\.spyder-py3\\Testing\\'
#os.chdir(dirs)
for filename in glob.glob(os.path.join(dirs, '*.txt')):
    try:
        for files in filename:
            file = open(filename, 'r', encoding ='utf-8')
            filecontents = file.read()
            filecontents = re.sub(r'\s+', ' ', filecontents)
            print(filecontents)
            filecontents = filecontents.strip('\n')
            b = email.message_from_string(filecontents)
            if b.is_multipart():
                for payload in b.get_payload():
                    # if payload.is_multipart(): ...
                    print (payload.get_payload())
            else:
                print (b.get_payload())
            summary = summarize(filecontents, ratio =0.10)
            print(summary)
            kw = keywords(filecontents, words=15)
            print(kw)
            break
            #writer.writerow([file, summary, kw])
    except Exception as e:
        pass

0 个答案:

没有答案