从电子邮件获取HTML的问题:我只收到一个字符串

时间:2019-05-04 04:25:05

标签: python html email parsing

我需要从电子邮件中删除html,我编写的代码对于其他电子邮件也能正常工作,但是对于来自一个发件人的电子邮件,它返回的字符串却不是HTML,而是大字符串。

更新:我收到的字符串在base64中,但是我的代码仍然只能获取电子邮件的base64部分,而不能获取HTML,因此仍然存在问题。

这是我的代码的样子:

m = imaplib.IMAP4_SSL('imap.mail.yahoo.com')
m.login('xxxxxx', 'xxxxxxxx')

rv, mailboxes = m.list()
if rv == 'OK':
    print ("Mailboxes:")
    print (mailboxes)


def process_mailbox(m):
  rv, data = m.search(None, "ALL")
  if rv != 'OK':
      print ("No messages found!")
      return

  for num in data[0].split():
      rv, data = m.fetch(num, '(RFC822)')
      if rv != 'OK':
          print ("ERROR getting message"), num
          return

      msg = email.message_from_string(data[0][1])
      print ('Message %s: %s' % (num, msg['Subject']))
      print ('Raw Date:', msg['Date'])
      date_tuple = email.utils.parsedate_tz(msg['Date'])
      if date_tuple:
          local_date = datetime.datetime.fromtimestamp(
              email.utils.mktime_tz(date_tuple))
          print ("Local Date:"), \
              local_date.strftime("%a, %d %b %Y %H:%M:%S")



m.select('MAILBOX', readonly=True)

resp, items = m.search(None, "ALL")
items = items[0].split() # getting the mails id



for emailid in items:
  resp, data = m.fetch(emailid, "(RFC822)") 
  raw_email = data[0][1]
  print (raw_email)

通常在这一点上我会收到原始电子邮件,但是这次我所得到的只是一个很大的字符字符串,而且从未使用过实际的HTML:

的Content-Length:9617 X-防病毒:停住(VPS 190503-4,2019年5月3日),入站消息X-防病毒-状态:清洁PHRhYmxlIHN0eWxlPSJmb250LWZhbWlseTogVGFob21hLCBHZW5ldmEsIHNhbnMtc2Vy aWY7IiB3aWR0aD0iNjMwIiBjZWxsc3BhY2luZz0iMCIgY2VsbHBhZGRpbmc9IjEwIj4g PHRib2R5PgogPHRyPgogPHRkPgogPHRhYmxlIHN0eWxlPSJmb250LWZhbWlseTogVGFo b21hLCBHZW5ldmEsIHNhbnMtc2VyaWY7IiB3aWR0aD0iMTAwJSIgY2VsbHNwYWNpbmc9 IjAiIGNlbGxwYWRkaW5nPSIwIiBib3JkZXI9IjAiPiA8dGJvZHk + CiA8dHI + CiA8dGQg d2lkdGg9IjEwMCUiPjxjZW50ZXI + PGEgaHJlZj0iaHR0cHM6Ly9zaG9wLm1lcmNvbGEu Y29tIj48aW1nIHNyYz0iaHR0cHM6Ly9tZWRpYS5tZXJjb2xhLmNvbS9hc3NldHMvaW1h Z2VzL3Nob3Bsb2dvL01lcmNvbGFfTG9nb3YyLnBuZyIgd2lkdGg9IjMxNCIgaGVpZ2h0 PSIzOSIgYm9yZGVyPSIwIiAvPjwvYT48L2NlbnRlcj48L3RkPgogPC90cj4KIDx0cj4K IDx0ZD4KPGhyIHN0eWxlPSJjb2xvcjogI2VjZWNlYzsgd2lkdGg6IDEwMCU7IiAvPjwv DGQ + CiA8L3RyPgogPC90Ym9keT4KIDwvdGFibGU + CiA8L3RkPgogPC90cj4KIDx0cj4K IDx0ZCBzdHlsZT0icGFkZGluZzogMTBweCAzMHB4IDMwcHggMzBweDsiPjxzcGFuIHN0 eWxlPSJmb250LXNpemU6IDE1cHQ7IGZvbnQtd2VpZ2h0OiBib2xkOyBjb2x vcjogIzEy NmFhYTsiPlNoaXBwaW5nIENvbmZpcm1hdGlvbjwvc3Bhbj48YnIgLz48YnIgLz48Yj48 c3BhbiBzdHlsZT0iZm9udC1zaXplOiAxMnB0OyI + RGVhciBQYXRyaWNpYSBTY2hsZXVz bmVyLDwvc3Bhbj48L2I + PGJyIC8 + PGJyIC8 + PHNwYW4gc3R5bGU9ImZvbnQtc2l6ZTog MTJwdDsiPlRoYW5rIHlvdSBmb3IgeW91ciByZWNlbnQgb3JkZXIgZnJvbSA8YSBocmVm PSJodHRwczovL3Nob3AubWVyY29sYS5jb20iPk1lcmNvbGE8L2E + LiBXZSBhcmUgcGxl YXNlZCB0byBpbmZvcm0geW91IHRoYXQgeW91IGFyZSBub3cgb25lIHN0ZXAgY2xvc2Vy IHRvIHRha2luZyBjb250cm9sIG9mIHlvdXIgaGVhbHRoISBZb3VyIG9yZGVyIG51bWJl ciBPMTUwOTMxMDkgaGFzIGJlZW4gc2hpcHBlZCBhbmQgaXMgb24gaXRzIHdheSB0byB5 b3UuPGJyIC8 + PGJyIC8 + VGhlIHNoaXBtZW50IGRldGFpbHMgYXJlIGFzIGJlbG93Ojwv c3Bhbj48YnIgLz48YnIgLz4gPHRhYmxlIHN0eWxlPSJmb250LXNpemU6IDEycHQ7IGZv bnQtZmFtaWx5OiBUYWhvbWEsIEdlbmV2YSwgc2Fucy1zZXJpZjsgdGV4dC1hbGlnbjog bGVmdDsiIHdpZHRoPSIxMDAlIiBjZWxsc3BhY2luZz0iMCIgY2VsbHBhZGRpbmc9Ijci

1 个答案:

答案 0 :(得分:1)

由于您能够从原始数据中创建Message对象,因此可以使用其功能来提取所需的信息。

from email import policy

# Set the policy to create an EmailMessage instance. 
msg = email.message_from_string(data[0][1], policy=policy.default)
# Get the part most likely to be the preferred body.
body = msg.get_body()
# get_content() will automatically decode from base64 or quoted-printable. 
print(body.get_content())

在创建消息对象时将策略设置为policy.default可以确保返回一个EmailMessage实例-该对象提供了get_bodyget_content方法。

EmailMessage.get_body()

  

返回最适合作为邮件“正文”的MIME部分。

您可以提供一个子类型列表来指导其行为。