我设置了一个spider_idle
信号,以将另一批网址提供给蜘蛛。但是,这似乎在一开始就可以正常工作,但是随后Crawled (200)...
消息越来越少地出现,直到最终停止出现。我有115个测试URL可以分发,正如Scrapy所说Crawled 38 pages...
。下面是蜘蛛和抓取日志的代码。
通常,我正在实施两阶段爬网,第一阶段仅将URL下载到urls.jl
文件,第二阶段是对这些URls进行抓取。我现在正在接近第二只蜘蛛的编码。
import json
import scrapy
import logging
from scrapy import signals
from scrapy.http.request import Request
from scrapy.exceptions import DontCloseSpider
class A2ndexample_comSpider(scrapy.Spider):
name = '2nd_example_com'
allowed_domains = ['www.example.com']
def parse(self, response):
pass
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = cls(crawler, *args, **kwargs)
crawler.signals.connect(spider.idle_consume, signals.spider_idle)
return spider
def __init__(self, crawler):
self.crawler = crawler
# read from file
self.urls = []
with open('urls.jl', 'r') as f:
for line in f:
self.urls.append(json.loads(line))
# How many urls to return from start_requests()
self.batch_size = 5
def start_requests(self):
for i in range(self.batch_size):
if 0 == len(self.urls):
return
url = self.urls.pop(0)
yield Request(url["URL"])
def idle_consume(self):
# Everytime spider is about to close check our urls
# buffer if we have something left to crawl
reqs = self.start_requests()
if not reqs:
return
logging.info('Consuming batch... [left: %d])' % len(self.urls))
for req in reqs:
self.crawler.engine.schedule(req, self)
raise DontCloseSpider
日志:
INFO: Spider opened
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
DEBUG: Telnet console listening on 127.0.0.1:6023
DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-140-m-wroclaw-ID3EMF6.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/wynajem/mieszkanie/dolnoslaskie/?nrAdsPerPage=72&search[order]=filter_float_price%3Adesc> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/wynajem-obiekt-5-mieszkan-dla-firmy-legnica-ID3Khvk.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/komfortowy-apartament-sky-tower-41-pietro-ID3ytn1.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/apartament-z-przepieknym-widokiem-z-45-pietra-ID3PWvI.html> (referer: None)
INFO: Consuming batch... [left: 110])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-139-04-m-wroclaw-ID3A6dp.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/centrum-willowy-lokal-dostepny-dla-firmy-ID3TgV4.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/wynajem-pietro-na-16-osob-legnica-ID3KcPe.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/wynajem/mieszkanie/dolnoslaskie/?nrAdsPerPage=72&search%5Border%5D=filter_float_price%3Adesc&page=2> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/apartament-trzypokojowy-na-44-pietrze-sky-tower-ID3qXA8.html> (referer: None)
INFO: Consuming batch... [left: 105])
DEBUG: Filtered duplicate request: <GET https://www.example.com/wynajem/mieszkanie/dolnoslaskie/?nrAdsPerPage=72&search[order]=filter_float_price%3Adesc> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-3-pokoje-ul-zatorska-wysoki-standard-ID3GBfa.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/nowe-mieszkanie-2-pokoje-wroclaw-ul-gornicza-ID2NeJT.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/sprzedam-mieszkanie-bezczynszowe-gromadka-ID3S1sA.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-ID3ALrp.html> (referer: None)
INFO: Consuming batch... [left: 100])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/2-pok-balkonosobna-kuchniawindado-urzadzenia-ID3Scza.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-47-m-wroclaw-ID3RTOY.html> (referer: None)
INFO: Consuming batch... [left: 95])
INFO: Consuming batch... [left: 90])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/luksusowy-apartament-101m2-centrum-obok-renomy-ID3O1yI.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-70-m-wroclaw-ID3SS4A.html> (referer: None)
INFO: Consuming batch... [left: 85])
INFO: Consuming batch... [left: 80])
INFO: Consuming batch... [left: 75])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-103-m-wroclaw-ID2ZhbS.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/wynajem/mieszkanie/dolnoslaskie/?nrAdsPerPage=72&search%5Border%5D=filter_float_price%3Adesc&page=3> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/luksusowe-przestronne-dwa-garaze-ID3LwIs.html> (referer: None)
INFO: Consuming batch... [left: 70])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-118-74-m-wroclaw-ID2W9Fd.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/ekskluzywny-apartament-z-dostepem-do-silowni-i-spa-ID3pGmQ.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-170-m-wroclaw-ID3MBI0.html> (referer: None)
INFO: Consuming batch... [left: 65])
INFO: Crawled 25 pages (at 25 pages/min), scraped 0 items (at 0 items/min)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/loftowe-mieszkanie-krzyki-100-m2-ID3Tfc0.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/nieruchompsc-dla-pracownikow-od-zaraz-ID3TrcA.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/stare-miasto-3-pok-69m2-luxurious-apartment-ID3Qn4o.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/loftowe-100-metrowe-mieszkanie-idealne-na-biuro-ID3Txu4.html> (referer: None)
INFO: Consuming batch... [left: 60])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/lesnica-ul-niepierzynska-123-m2-6-pokoi-ogrod-ID3OoI8.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-63-m-wroclaw-ID3Tbne.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/wynajem/mieszkanie/dolnoslaskie/?nrAdsPerPage=72&search%5Border%5D=filter_float_price%3Adesc&page=4> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/komfortow-apartament-do-wynajecia-3km-od-wroclawia-ID3SA0M.html> (referer: None)
INFO: Consuming batch... [left: 55])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/zamienie-mieszanie-2-pokoje-40m2-bielawa-na-wieksz-ID3yyFN.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/apartament-sky-tower-z-wanna-przy-oknie-i-sauna-ID2Z7EA.html> (referer: None)
INFO: Consuming batch... [left: 50])
INFO: Consuming batch... [left: 45])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/ul-ksiecia-witolda-3pok-75m2-wysoki-standard-3700-ID3PK2g.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/komfortowe-mieszkanie-do-wynajecia-ID3Lcvk.html> (referer: None)
INFO: Consuming batch... [left: 40])
INFO: Consuming batch... [left: 35])
INFO: Consuming batch... [left: 30])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/hit-klimatyczne-w-sercu-wroclawia-2-pok-ID3SkJ2.html> (referer: None)
INFO: Consuming batch... [left: 25])
INFO: Consuming batch... [left: 20])
INFO: Consuming batch... [left: 15])
INFO: Consuming batch... [left: 10])
INFO: Crawled 38 pages (at 13 pages/min), scraped 0 items (at 0 items/min)
INFO: Consuming batch... [left: 5])
INFO: Consuming batch... [left: 0])
INFO: Consuming batch... [left: 0])
INFO: Consuming batch... [left: 0])
INFO: Consuming batch... [left: 0])
(...)
INFO: Consuming batch... [left: 0])
INFO: Consuming batch... [left: 0])
INFO: Consuming batch... [left: 0])
INFO: Crawled 38 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
INFO: Consuming batch... [left: 0])
INFO: Consuming batch... [left: 0])
INFO: Consuming batch... [left: 0])
...
我期望蜘蛛会爬网所有115个URL,而不仅仅是38个。此外,如果它不想再爬网,并且信号处理函数不会引发DontCloseSpider
,那么应该那不是至少要关机吗?
答案 0 :(得分:1)
丢失的请求不会失败,否则,您还将在日志中看到有关此请求的信息。他们根本没有被发送。
如果您仔细观察日志,将会注意到以下消息:
DEBUG: Filtered duplicate request: <GET https://www.example.com/wynajem/mieszkanie/dolnoslaskie/?nrAdsPerPage=72&search[order]=filter_float_price%3Adesc> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
由于缺少的请求被视为重复请求,因此将跳过这些请求。有关更多信息,请参见DUPEFILTER_CLASS
设置的文档。