我正在使用此Python代码从维基百科获取10000个URL:
class WikipediaCrawler(scrapy.Spider):
name = "wikipedia-crawler"
start_urls = ['https://en.wikipedia.org/wiki/Special:Random']
def start_requests(self):
for page_counter in range(0, 10000):
yield scrapy.Request(url=self.start_urls[0], callback=self.save_url)
for page_counter in range(0, 10000):
yield scrapy.Request(url=self.start_urls[page_counter], callback=self.parse)
def parse(self, response):
urls = []
for link in response.css('a::attr(href)'):
urls.append(link.extract())
file_name = response.url.split("/")[-1] + '.html'
file_name = file_name.replace(':', '_')
with open('crawled/' + file_name, 'wb') as f:
f.write(response.body)
yield {
str(response.url):
{
'ranking': 5,
'links': urls
}
}
def save_url(self, response):
self.start_urls.append(response.url)
它不起作用。它只处理一页。
我正在使用https://en.wikipedia.org/wiki/Special:Random网址获取随机的维基百科页面,从而获取其网址。
答案 0 :(得分:1)
您正在产生相同的start_url
即url=self.start_urls[0]
10000个关系。我认为您应该删除start_urls
变量并将start_requests
更改为:
def start_requests(self):
for page_counter in range(0, 10000):
yield scrapy.Request(url='https://en.wikipedia.org/wiki/Special:Random')
我对the docs的理解是,scrapy会将start_requests
用作生成器,而不是start_urls
作为列表。
所以在内部我猜它有点像
starts = start_requests()
parse(next(starts))
答案 1 :(得分:1)
dont_filter=True
参数解决了问题:
https://doc.scrapy.org/en/latest/topics/request-response.html#request-objects
dont_filter
( boolean ) - 表示此请求不应该是 由调度程序过滤。当您想要执行时使用此选项 多次相同的请求,忽略重复过滤器。使用 它小心翼翼,否则你会陷入爬行循环。默认为False
。
代码已更新:
import crapy
class WikipediaCrawler(scrapy.Spider):
name = "wikipedia-crawler"
start_urls = ['https://en.wikipedia.org/wiki/Special:Random']
def start_requests(self):
for page_counter in range(0, 10000):
yield scrapy.Request(url=self.start_urls[0], callback=self.parse, dont_filter=True)
def parse(self, response):
urls = []
for link in response.css('a::attr(href)'):
urls.append(link.extract())
file_name = response.url.split("/")[-1] + '.html'
file_name = file_name.replace(':', '_')
with open('crawled/' + file_name, 'wb') as f:
f.write(response.body)
yield {
str(response.url):
{
'ranking': 5,
'links': urls
}
}