我有一个格式化的文本文件,由outlook电子邮件组成。
我正在尝试解析“发件人”,“主题”(分成多个字段),然后阅读其余内容,直到下一封新电子邮件由新来自:
指示首先,我试图强制它,因为这是对概念证明的测试,但是,我只是在链中收到最后电子邮件。
l = []
with open(r'transcripts.txt', 'r') as transcripts:
for line in transcripts:
is_new_subject = line.lower().startswith('from')
if is_new_subject:
record = {}
record['from'] = line.split(':')[1]
for line in transcripts:
if line.lower().startswith('subject'):
subject = line.split(':')[1]
record['subject'] = subject
split_it = subject.split('.')
record['show'] = split_it[0]
record['air_date'] = split_it[1]
record['hour'] = split_it[2]
record['content'] = ""
for line in transcripts:
record['content'] += line
is_new_subject = line.lower().startswith('from')
if is_new_subject:
l.append(record)
break
with open('output.json', 'w') as outfile:
json.dump(l, outfile, indent=4)
任何想法,我将从头开始重新加工
答案 0 :(得分:1)
您的代码有点难以阅读,我认为如果将其分解为函数,调试它会更容易。另外,我建议使用python的re库进行这种类型的文本处理,因为它比仅测试静态字符串更灵活。例如:
import re
def parse_emails_from_list(email_list):
"""returns a list of emails from an email list"""
return re.compile("From:").split(email_list)
def parse_email_details_from_email(email):
"""do some more processing here"""
email = {}
email['subject'] = #parse your email details here
#...
#...
return email
if __name__ == "main":
"""main loop"""
parsed_emails = []
with open(r'transcripts.txt', 'r') as email_list:
email_list = parse_emails_from_list(transcripts)
[parsed_emails.append(parse_email_details_from_email(email)) for email in email_list]
with open('output.json', 'w') as outfile:
json.dump(parsed_emails, outfile, indent=4)
在仔细查看代码后,很明显您的循环逻辑肯定是您遇到问题的地方。
答案 1 :(得分:1)
您应该尝试Email Parser。这很容易使用。出于某种原因,此电子邮件不适用于多部分电子邮件。所以我使用了@Max Paymar创建的split函数。谢谢@Max Paymar。
import email
import re
def parse_emails_from_list(email_list):
"""returns a list of emails from an email list"""
return re.compile("From:").split(email_list)
a=open('sampleEmail.txt','r')
email_list = parse_emails_from_list(a.read())
for E_mail in email_list:
msg = email.message_from_string('From:'+E_mail)
print msg['Subject']
print msg['From']
print msg.get_payload()