在这里刮新手。我正在尝试使用BeautifulSoup编写一个刮刀,用于从Gmail帐户中的电子邮件中删除html表格。使用IMAP,脚本会间歇性地检查收件箱。我不确定如何从电子邮件中提取HTML,这是刮表的必要条件。目前,它提取正文,而不是原始HTML:
m.select("[Gmail]/All Mail")
resp, items = m.search(None, "ALL")
items = items[0].split()
for emailid in items:
resp, data = m.fetch(emailid, "(RFC822)")
email_body = data[0][1] # getting the mail content
mail = email.message_from_string(email_body)
soup = BeautifulSoup(mail)
tables = soup.find_all("table", width=900)
...
答案 0 :(得分:1)
谢谢你们!在我意识到HTML仍然被提取之后,我发现了一个非常简单的解决方案,就在正文之后。
for emailid in items:
resp, data = m.fetch(emailid, "(RFC822)") # fetching the mail, "`(RFC822)`" means "get the whole stuff", but you can ask for headers only, etc
email_body = data[0][1] # getting the mail content
start = email_body.find('<div');
email = email_body[start:]
soup = BeautifulSoup(email)