我的脚本通过imaplib
从电子邮件收件箱中抓取HTML代码,将其传递给BeautifulSoup
并尝试提取其中的所有href
。
rv, data = M.SEARCH(None, '(FROM "foo@bar.com")')
if rv == 'OK':
for num in data[0].split():
typ, data = M.fetch(num, '(RFC822)')
html = data[0][1]
soup = BeautifulSoup(html, 'lxml')
for a in soup.find_all('a', href=True):
print a['href']
但是html
变量包含的HTML代码每个N
个字符都有一个新行,阻止BeautifulSoup准确地返回href
,特别是那些被新分割的长片线。
Theres还有奇怪的字符,例如=0D
和3D
。
messages, <a=0D
href=3D"http://links.google.com/wf/click?upn=3DOGGGYNMPA980E3DmngbHusD=
Uo-2BK17XLM3ogFJfQXXXfMWZLdsQSSVv33HbPoHPXGcH8tSf9ZFFU5i-2FrV4O6ISlpDCIVaN5=
83xr1CGoa5yxZimagE5JiSUAhbZH8P7WiNvf35BsXrCxmrmRLMGB-2BJAQ-3D-3D_IcMuwcQVVt=
a699aeVjRRVxwBCNHkXaWO-2FyIlAqZ7CPsryDB24UVYZbMIvGLJb13chayC-2FLeucv-2FTrko=
7LaiaWHkzy85DWXrK1olI1SEJZs-2BMCAWfoVfloGJivlLSH0GQk0XeVT0j383tZrsymuWLF0S2=
q5j3LR91e76dRXQe7p8t5CgrBe-2FqGk6bmURG9XCNw3dwpHnymaR-2FggHQx6GnbbueF7PVp2H=
-2BGoHUEkMOSXJ8FfSgQIiGICvxz1zcBJPw-2FRoE3YDl-2By8XETkXjVaNchNA1ZN8FDCD5VUf=
V9oUOnavAirXX-2FEw1THfSpV4VYDX">unsubscribe</a></td>=0D
</tr>=0D
<tr>=0D
<td height=3D"12"></td>=0D
</tr>=0D
我们可以做些什么来解决这个问题?
答案 0 :(得分:1)
您可以使用quopri解码Quoted-printable数据:
Quoted-Printable或QP编码是一种使用可打印ASCII字符(字母数字和等号“=”)的编码,用于通过7位数据路径传输8位数据,或者通常在不是8位干净的媒体。1它被定义为用于电子邮件的MIME内容传输编码。
QP通过使用等号“=”作为转义字符来工作。它还将行长度限制为76,因为某些软件对行长度有限制。
hidden-field
将输出:
html = """<a=0D
href=3D"http://links.google.com/wf/click?upn=3DOGGGYNMPA980E3DmngbHusD=
Uo-2BK17XLM3ogFJfQXXXfMWZLdsQSSVv33HbPoHPXGcH8tSf9ZFFU5i-2FrV4O6ISlpDCIVaN5=
83xr1CGoa5yxZimagE5JiSUAhbZH8P7WiNvf35BsXrCxmrmRLMGB-2BJAQ-3D-3D_IcMuwcQVVt=
a699aeVjRRVxwBCNHkXaWO-2FyIlAqZ7CPsryDB24UVYZbMIvGLJb13chayC-2FLeucv-2FTrko=
7LaiaWHkzy85DWXrK1olI1SEJZs-2BMCAWfoVfloGJivlLSH0GQk0XeVT0j383tZrsymuWLF0S2=
q5j3LR91e76dRXQe7p8t5CgrBe-2FqGk6bmURG9XCNw3dwpHnymaR-2FggHQx6GnbbueF7PVp2H=
-2BGoHUEkMOSXJ8FfSgQIiGICvxz1zcBJPw-2FRoE3YDl-2By8XETkXjVaNchNA1ZN8FDCD5VUf=
V9oUOnavAirXX-2FEw1THfSpV4VYDX">unsubscribe</a></td>=0D
</tr>=0D
<tr>=0D
<td height=3D"12"></td>=0D
</tr>=0D"""
from bs4 import BeautifulSoup
import quopri
soup = BeautifulSoup(quopri.decodestring(html), "lxml")
print(soup)
print(soup.select_one("a")["href"])
如果您打印十六进制字符 3D 和 0D ,您可以看到这一切都有意义:
<html><body><a href="http://links.google.com/wf/click?upn=OGGGYNMPA980E3DmngbHusDUo-2BK17XLM3ogFJfQXXXfMWZLdsQSSVv33HbPoHPXGcH8tSf9ZFFU5i-2FrV4O6ISlpDCIVaN583xr1CGoa5yxZimagE5JiSUAhbZH8P7WiNvf35BsXrCxmrmRLMGB-2BJAQ-3D-3D_IcMuwcQVVta699aeVjRRVxwBCNHkXaWO-2FyIlAqZ7CPsryDB24UVYZbMIvGLJb13chayC-2FLeucv-2FTrko7LaiaWHkzy85DWXrK1olI1SEJZs-2BMCAWfoVfloGJivlLSH0GQk0XeVT0j383tZrsymuWLF0S2q5j3LR91e76dRXQe7p8t5CgrBe-2FqGk6bmURG9XCNw3dwpHnymaR-2FggHQx6GnbbueF7PVp2H-2BGoHUEkMOSXJ8FfSgQIiGICvxz1zcBJPw-2FRoE3YDl-2By8XETkXjVaNchNA1ZN8FDCD5VUfV9oUOnavAirXX-2FEw1THfSpV4VYDX">unsubscribe</a>
<tr>
<td height="12"></td>
</tr> </body></html>
http://links.google.com/wf/click?upn=OGGGYNMPA980E3DmngbHusDUo-2BK17XLM3ogFJfQXXXfMWZLdsQSSVv33HbPoHPXGcH8tSf9ZFFU5i-2FrV4O6ISlpDCIVaN583xr1CGoa5yxZimagE5JiSUAhbZH8P7WiNvf35BsXrCxmrmRLMGB-2BJAQ-3D-3D_IcMuwcQVVta699aeVjRRVxwBCNHkXaWO-2FyIlAqZ7CPsryDB24UVYZbMIvGLJb13chayC-2FLeucv-2FTrko7LaiaWHkzy85DWXrK1olI1SEJZs-2BMCAWfoVfloGJivlLSH0GQk0XeVT0j383tZrsymuWLF0S2q5j3LR91e76dRXQe7p8t5CgrBe-2FqGk6bmURG9XCNw3dwpHnymaR-2FggHQx6GnbbueF7PVp2H-2BGoHUEkMOSXJ8FfSgQIiGICvxz1zcBJPw-2FRoE3YDl-2By8XETkXjVaNchNA1ZN8FDCD5VUfV9oUOnavAirXX-2FEw1THfSpV4VYDX