抓取过程少于成功抓取

时间:2018-08-08 10:47:17

标签: python scrapy scrapy-spider

刮板有两个问题:

  1. 尽管我使用了'COOKIES_ENABLED': False和旋转代理,但一段时间后仍然有302个请求,并且应该为每个请求提供不同的IP。我通过几个302秒后重新启动刮板来解决了该问题

  2. 我看到刮板成功地完成了比其处理过程更多的爬行,我对此无能为力。在下面的示例中,我得到了 121 个响应,但只有 27 个被处理。

蜘蛛

class MySpider(Spider):
    name = 'MySpider'
    custom_settings = {
        'DOWNLOAD_DELAY': 0,
        'RETRY_TIMES': 1,
        'LOG_LEVEL': 'DEBUG',
        'CLOSESPIDER_ERRORCOUNT': 3,
        'COOKIES_ENABLED': False,
    }
    # I need to manually control when spider to stop, otherwise it runs forever
    handle_httpstatus_list = [301, 302]

    def start_requests(self):
        for row in self.df.itertuples():
            yield Request(
                url=row.link,
                callback=self.parse,
                priority=100
            )

    def close(self, reason):
        self.logger.info('TOTAL ADDED: %s' % self.added)

    def parse(self, r):
        if r.status == 302:
            # I need to manually control when spider to stop, otherwise it runs forever
            raise CloseSpider("")
        else:
            # do parsing stuff
                self.added += 1
                self.logger.info('{} left'.format(len(self.df[self.df['status'] == 0])))

输出

2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url1> (referer: None)
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url2> (referer: None)
2018-08-08 12:24:24 [MySpider] INFO: 52451 left
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url3> (referer: None)
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url4> (referer: None)
2018-08-08 12:24:24 [MySpider] INFO: 52450 left
2018-08-08 12:24:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://mytarget.com/url4> (referer: None)


2018-08-08 12:24:37 [MySpider] INFO: TOTAL ADDED: 27
2018-08-08 12:24:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
...
...
 'downloader/response_status_count/200': 121,
 'downloader/response_status_count/302': 4,

它成功抓取了很多内容(比抓取次数多了 3x 4x 倍)。 如何强制处理已爬网的所有内容?

我可以牺牲速度,但是我不想浪费成功爬网200s的时间

1 个答案:

答案 0 :(得分:1)

当您200时,调度程序可能未将对parse()方法的所有CloseSpider()响应都传递给302

登录并忽略[{"COUNTRY","ID","DATE","PASS"}]: [{"USA","4639","09/08/2014","1"}], [{"SWE","9089","07/09/2014","1"}], [{"GER","2345","11/10/2014","0"}], [{"DEN","0987","02/11/2014","1"}], [{"UK","1653","03/11/2014","0"}]. ,然后让蜘蛛完成。