通过IMAP抓取电子邮件HTML

时间:2014-01-07 01:39:14

标签: python html web-scraping beautifulsoup imap

在这里刮新手。我正在尝试使用BeautifulSoup编写一个刮刀,用于从Gmail帐户中的电子邮件中删除html表格。使用IMAP,脚本会间歇性地检查收件箱。我不确定如何从电子邮件中提取HTML,这是刮表的必要条件。目前,它提取正文,而不是原始HTML:

m.select("[Gmail]/All Mail") 

resp, items = m.search(None, "ALL") 
items = items[0].split() 
for emailid in items:
    resp, data = m.fetch(emailid, "(RFC822)") 
    email_body = data[0][1] # getting the mail content
    mail = email.message_from_string(email_body)  
    soup = BeautifulSoup(mail)
    tables = soup.find_all("table", width=900)
    ...

1 个答案:

答案 0 :(得分:1)

谢谢你们!在我意识到HTML仍然被提取之后,我发现了一个非常简单的解决方案,就在正文之后。

for emailid in items:
    resp, data = m.fetch(emailid, "(RFC822)") # fetching the mail, "`(RFC822)`" means "get the whole stuff", but you can ask for headers only, etc
    email_body = data[0][1] # getting the mail content
    start = email_body.find('<div');
    email = email_body[start:]  
    soup = BeautifulSoup(email)