我正在尝试从以下网址中抓取电子邮件地址。
myurl="https://www.charitychoice.co.uk/alzheimers-research-uk"
agent = {'User-Agent': 'Magic Browser'}
req1 = requests.get(myurl, headers=agent, verify=False)
soup2 = BeautifulSoup(req1.content, "lxml")
for email in soup2.findAll('div', {"class": "charity-contact-details"}):
for email1 in email.findAll('p'):
for email2 in email1.findAll('span', {"itemprop": "email"}):
for email3 in email2.findAll('a'):
email4 = email3.text
print(email4)
它不打印我期望的电子邮件。
它与Selenium PhantomJS解析器一起使用,它需要很长时间才能显示电子邮件地址
请使用正确的解析器对此进行帮助,以立即返回电子邮件地址
答案 0 :(得分:1)
您的代码很好,但是问题在于<span itemprop="email">
的内容如下:
<span itemprop="email">
<script language="javascript" type="text/javascript">
<!--
{document.write(String.fromCharCode(60,97,32,104,114,101,102,61,34,109,97,105,108,116,111,58,101,110,113,117,105,114,105,101,115,64,97,108,122,104,101,105,109,101,114,115,114,101,115,101,97,114,99,104,117,107,46,111,114,103,34,32,62,101,110,113,117,105,114,105,101,115,64,97,108,122,104,101,105,109,101,114,115,114,101,115,101,97,114,99,104,117,107,46,111,114,103,60,47,97,62))}
//-->
</script>
</span>
换句话说,混淆电子邮件是为了防止垃圾邮件。不过,没有什么阻止我们进行解析:
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.charitychoice.co.uk/alzheimers-research-uk"
agent = {"User-Agent": "Magic Browser"}
req = requests.get(url, headers=agent, verify=False)
soup = BeautifulSoup(req.content, "lxml")
for span in soup.findAll("span", {"itemprop": "email"}):
email = "".join([chr(int(n)) for n in re.split("[^\d]", span.text) if n])
for x in BeautifulSoup(email, "lxml").findAll("a"):
print(x.text)
输出:
enquiries@alzheimersresearchuk.org
我希望这不是垃圾邮件的答案!另外,该程序发出请求,并且Beautiful Soup必须解析HTML,因此它不是“即时”的。