为什么我的spider_idle /按需/ URL馈送逐渐关闭?

时间:2019-01-31 21:21:09

标签: python scrapy signals

我设置了一个spider_idle信号,以将另一批网址提供给蜘蛛。但是,这似乎在一开始就可以正常工作,但是随后Crawled (200)...消息越来越少地出现,直到最终停止出现。我有115个测试URL可以分发,正如Scrapy所说Crawled 38 pages...。下面是蜘蛛和抓取日志的代码。

通常,我正在实施两阶段爬网,第一阶段仅将URL下载到urls.jl文件,第二阶段是对这些URls进行抓取。我现在正在接近第二只蜘蛛的编码。

import json
import scrapy
import logging
from scrapy import signals
from scrapy.http.request import Request
from scrapy.exceptions import DontCloseSpider


class A2ndexample_comSpider(scrapy.Spider):
    name = '2nd_example_com'
    allowed_domains = ['www.example.com']

    def parse(self, response):
        pass

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = cls(crawler, *args, **kwargs)
        crawler.signals.connect(spider.idle_consume, signals.spider_idle)
        return spider

    def __init__(self, crawler):
        self.crawler = crawler
        # read from file
        self.urls = []

        with open('urls.jl', 'r') as f:
            for line in f:
                self.urls.append(json.loads(line))
        # How many urls to return from start_requests()
        self.batch_size = 5

    def start_requests(self):
        for i in range(self.batch_size):
            if 0 == len(self.urls):
                return
            url = self.urls.pop(0)
            yield Request(url["URL"])

    def idle_consume(self):
        # Everytime spider is about to close check our urls 
        # buffer if we have something left to crawl
        reqs = self.start_requests()
        if not reqs:
            return
        logging.info('Consuming batch... [left: %d])' % len(self.urls))
        for req in reqs:
            self.crawler.engine.schedule(req, self)
        raise DontCloseSpider

日志:

INFO: Spider opened
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
DEBUG: Telnet console listening on 127.0.0.1:6023
DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-140-m-wroclaw-ID3EMF6.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/wynajem/mieszkanie/dolnoslaskie/?nrAdsPerPage=72&search[order]=filter_float_price%3Adesc> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/wynajem-obiekt-5-mieszkan-dla-firmy-legnica-ID3Khvk.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/komfortowy-apartament-sky-tower-41-pietro-ID3ytn1.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/apartament-z-przepieknym-widokiem-z-45-pietra-ID3PWvI.html> (referer: None)
INFO: Consuming batch... [left: 110])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-139-04-m-wroclaw-ID3A6dp.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/centrum-willowy-lokal-dostepny-dla-firmy-ID3TgV4.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/wynajem-pietro-na-16-osob-legnica-ID3KcPe.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/wynajem/mieszkanie/dolnoslaskie/?nrAdsPerPage=72&search%5Border%5D=filter_float_price%3Adesc&page=2> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/apartament-trzypokojowy-na-44-pietrze-sky-tower-ID3qXA8.html> (referer: None)
INFO: Consuming batch... [left: 105])
DEBUG: Filtered duplicate request: <GET https://www.example.com/wynajem/mieszkanie/dolnoslaskie/?nrAdsPerPage=72&search[order]=filter_float_price%3Adesc> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-3-pokoje-ul-zatorska-wysoki-standard-ID3GBfa.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/nowe-mieszkanie-2-pokoje-wroclaw-ul-gornicza-ID2NeJT.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/sprzedam-mieszkanie-bezczynszowe-gromadka-ID3S1sA.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-ID3ALrp.html> (referer: None)
INFO: Consuming batch... [left: 100])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/2-pok-balkonosobna-kuchniawindado-urzadzenia-ID3Scza.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-47-m-wroclaw-ID3RTOY.html> (referer: None)
INFO: Consuming batch... [left: 95])
INFO: Consuming batch... [left: 90])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/luksusowy-apartament-101m2-centrum-obok-renomy-ID3O1yI.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-70-m-wroclaw-ID3SS4A.html> (referer: None)
INFO: Consuming batch... [left: 85])
INFO: Consuming batch... [left: 80])
INFO: Consuming batch... [left: 75])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-103-m-wroclaw-ID2ZhbS.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/wynajem/mieszkanie/dolnoslaskie/?nrAdsPerPage=72&search%5Border%5D=filter_float_price%3Adesc&page=3> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/luksusowe-przestronne-dwa-garaze-ID3LwIs.html> (referer: None)
INFO: Consuming batch... [left: 70])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-118-74-m-wroclaw-ID2W9Fd.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/ekskluzywny-apartament-z-dostepem-do-silowni-i-spa-ID3pGmQ.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-170-m-wroclaw-ID3MBI0.html> (referer: None)
INFO: Consuming batch... [left: 65])
INFO: Crawled 25 pages (at 25 pages/min), scraped 0 items (at 0 items/min)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/loftowe-mieszkanie-krzyki-100-m2-ID3Tfc0.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/nieruchompsc-dla-pracownikow-od-zaraz-ID3TrcA.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/stare-miasto-3-pok-69m2-luxurious-apartment-ID3Qn4o.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/loftowe-100-metrowe-mieszkanie-idealne-na-biuro-ID3Txu4.html> (referer: None)
INFO: Consuming batch... [left: 60])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/lesnica-ul-niepierzynska-123-m2-6-pokoi-ogrod-ID3OoI8.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/mieszkanie-63-m-wroclaw-ID3Tbne.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/wynajem/mieszkanie/dolnoslaskie/?nrAdsPerPage=72&search%5Border%5D=filter_float_price%3Adesc&page=4> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/komfortow-apartament-do-wynajecia-3km-od-wroclawia-ID3SA0M.html> (referer: None)
INFO: Consuming batch... [left: 55])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/zamienie-mieszanie-2-pokoje-40m2-bielawa-na-wieksz-ID3yyFN.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/apartament-sky-tower-z-wanna-przy-oknie-i-sauna-ID2Z7EA.html> (referer: None)
INFO: Consuming batch... [left: 50])
INFO: Consuming batch... [left: 45])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/ul-ksiecia-witolda-3pok-75m2-wysoki-standard-3700-ID3PK2g.html> (referer: None)
DEBUG: Crawled (200) <GET https://www.example.com/oferta/komfortowe-mieszkanie-do-wynajecia-ID3Lcvk.html> (referer: None)
INFO: Consuming batch... [left: 40])
INFO: Consuming batch... [left: 35])
INFO: Consuming batch... [left: 30])
DEBUG: Crawled (200) <GET https://www.example.com/oferta/hit-klimatyczne-w-sercu-wroclawia-2-pok-ID3SkJ2.html> (referer: None)
INFO: Consuming batch... [left: 25])
INFO: Consuming batch... [left: 20])
INFO: Consuming batch... [left: 15])
INFO: Consuming batch... [left: 10])
INFO: Crawled 38 pages (at 13 pages/min), scraped 0 items (at 0 items/min)
INFO: Consuming batch... [left: 5])
INFO: Consuming batch... [left: 0])
INFO: Consuming batch... [left: 0])
INFO: Consuming batch... [left: 0])
INFO: Consuming batch... [left: 0])
(...)
INFO: Consuming batch... [left: 0])
INFO: Consuming batch... [left: 0])
INFO: Consuming batch... [left: 0])
INFO: Crawled 38 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
INFO: Consuming batch... [left: 0])
INFO: Consuming batch... [left: 0])
INFO: Consuming batch... [left: 0])
...

我期望蜘蛛会爬网所有115个URL,而不仅仅是38个。此外,如果它不想再爬网,并且信号处理函数不会引发DontCloseSpider,那么应该那不是至少要关机吗?

1 个答案:

答案 0 :(得分:1)

丢失的请求不会失败,否则,您还将在日志中看到有关此请求的信息。他们根本没有被发送。

如果您仔细观察日志,将会注意到以下消息:

DEBUG: Filtered duplicate request: <GET https://www.example.com/wynajem/mieszkanie/dolnoslaskie/?nrAdsPerPage=72&search[order]=filter_float_price%3Adesc> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)

由于缺少的请求被视为重复请求,因此将跳过这些请求。有关更多信息,请参见DUPEFILTER_CLASS设置的文档。