我正在使用美味的汤,并希望在我的网络刮刀中抓取我选择的深度电子邮件。目前,我不确定为什么我的网络抓取工具不起作用。每次我运行它时,它都不会填充电子邮件列表。
#!/usr/bin/python
from bs4 import BeautifulSoup, SoupStrainer
import re
import urllib
import threading
def step2():
file = open('output.html', 'w+')
file.close()
# links already added
visited = set()
visited_emails = set()
scrape_page(visited, visited_emails, 'https://www.google.com', 2)
print('Webpages \n')
for w in visited:
print(w)
print('Emails \n')
for e in visited_emails:
print(e)
# Run recursively
def scrape_page(visited, visited_emails, url, depth):
if depth == 0:
return
website = urllib.urlopen(url)
soup = BeautifulSoup(website, parseOnlyThese=SoupStrainer('a', email=False))
emails = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", str(website))
first = str(website).split('mailto:')
for i in range(1, len(first)):
print(first.split('>')[0])
for email in emails:
if email not in visited_emails:
print('- got email ' + email)
visited_emails.add(email)
for link in soup:
if link.has_attr('href'):
if link['href'] not in visited:
if link['href'].startswith('https://www.google.com'):
visited.add(link['href'])
scrape_page(visited, visited_emails, link['href'], depth - 1)
def main():
step2()
main()
出于某种原因我不确定如何修复我的代码以将电子邮件添加到列表中。如果你能给我一些建议,我将不胜感激。感谢
答案 0 :(得分:1)
您只需要使用mailto:
查找hrefemails = [a["href"] for a in soup.select('a[href^=mailto:]')]
我认为https://www.google.com是您正在抓取的实际网站的占位符,因为在Google网页上没有任何mailto要抓取。如果您正在搜索源中的mailto,那么这将找到它们。