如何使用单个抓取工具从多个域抓取数据。我使用漂亮的汤完成了对单个网站的爬行,但无法弄清楚如何创建一个通用的网站。
答案 0 :(得分:0)
这个问题是有缺陷的,你要抓的网站必须有一些共同点。
from bs4 import BeautifulSoup
from urllib import request
import urllib.request
for counter in range(0,10):
# site = input("Type the name of your website") Python 3+
site = raw_input("Type the name of your website")
# Takes the website you typed and stores it in > site < variable
make_request_to_site = request.urlopen(site).read()
# Makes a request to the site that we stored in a var
soup = BeautifulSoup(make_request_to_site, "html.parser")
# We pass it through BeautifulSoup parser in this case html.parser
# Next we make a loop to find all links in the site that we stored
for link in soup.findAll('a'):
print link['href']
答案 1 :(得分:0)
如上所述,每个站点都有自己独特的选择器设置(,等)。单个通用爬虫无法进入网址并直观地了解要抓取的内容。
BeautifulSoup可能不是此类请求的最佳选择。 Scrapy是另一个网络爬虫库,它比BS4更强大。
有关stackoverflow的类似问题:Scrapy approach to scraping multiple URLs
Scrapy文档: https://doc.scrapy.org/en/latest/intro/tutorial.html