Scrapy:无法正确重新启动start_requests()

时间:2019-01-10 09:30:36

标签: python python-3.x scrapy generator scrapy-spider

我有一个刮板,它启动两个页面-其中一个是主页,另一个是.js文件,其中包含我需要提取的长坐标和经纬度坐标,因为在解析过程的后期需要它们。我想先处理.js文件,提取坐标,然后解析主页,然后开始爬网其链接/解析其项目。 为此,我在priority方法中使用了Request参数,并且说我希望首先处理.js页面。这种方法有效,但是只有大约70%的时间有效(必须归因于Scrapy的异步请求)。其余30%的时间我最终都使用了parse方法来尝试解析.js的经/纬度坐标,但是已经通过了网站的主页,因此无法解析它们。

由于这个原因,我尝试通过以下方式修复它: 在parse()方法中,请检查第n个url,如果它是第一个而不是.js,则重新启动Spider。但是,当我下次重新启动Spider时,它将首先正确传递.js,但是在其处理之后,Spider完成了工作并退出了脚本,并且没有错误,就好像它已经完成一样。 为什么会发生这种情况,与重新启动蜘蛛程序相比,重新启动蜘蛛程序时页面的处理有什么区别?

这是在我尝试调试正在执行的内容以及为什么重新启动时都会停止的两种情况下带有示例输出的代码。

class QuotesSpider(Spider):

    name = "bot"
    url_id = 0
    home_url = 'https://website.com'
    longitude = None
    latitude = None

    def __init__(self, cat=None):
        self.cat = cat.replace("-", " ")

    def start_requests(self):
        print ("Starting spider")
        self.start_urls = [
             self.home_url,
             self.home_url+'js-file-with-long-lat.js'
        ]
        for priority, url in enumerate(self.start_urls):
            print ("Processing", url)
            yield Request(url=url, priority=priority, callback=self.parse)


    def parse(self, response):
        print ("Inside parse")
        if self.url_id == 0 and response.url == self.home_url:
            self.alert("Loaded main page before long/lat page, restarting", False)
            for _ in self.start_requests():
                yield _
        else:
            print ("Everything is good, url id is", str(self.url_id))
            self.url_id +=1
            if self.longitude is None:
                for _ in self.parse_long_lat(response):
                    yield _
            else:
                print ("Calling parse cats")
                for cat in self.parse_cats(response):
                    yield cat

    def parse_long_lat(self, response):
        print ("called long lat")
        try:
            self.latitude = re.search('latitude:(\-?[0-9]{1,2}\.?[0-9]*)', 
            response.text).group(1)
            self.longitude = re.search('longitude:(\-?[0-9]{1,3}\.?[0-9]*)', 
            response.text).group(1)
            print ("Extracted coords")
            yield None
        except AttributeError as e:
            self.alert("\nCan't extract lat/long coordinates, store availability will not be parsed. ", False)
            yield None

    def parse_cats(self, response):           
        pass
        """ Parsing links code goes here """

spider正确启动时的输出,首先获取.js页面,其次开始解析cat:

Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Everything is good, url id is 0
called long lat
Extracted coords
Inside parse
Everything is good, url id is 1
Calling parse cats

然后脚本继续运行,并解析一切正常。 蜘蛛启动不正确,首先进入主页并重新启动start_requests()时输出:

Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Loaded main page before long/lat page, restarting
Starting spider
https://website.com
https://website.com/js-file-with-long-lat.js
Inside parse
Everything is good, url id is 0
called long lat
Extracted coords

该脚本将停止执行,并且没有错误,就像完成一样。

P.S。如果这很重要,我确实提到start_requests()中的处理URL是按逆序处理的,但是由于循环序列,我发现这很自然,我希望priority参数能够完成其工作(因为它大部分时间都这样做,并且应该按照Scrapy的文档进行操作。

2 个答案:

答案 0 :(得分:1)

为什么您的Spider无法在“重新启动”的情况下继续;您可能会遇到被过滤/删除的重复请求。既然页面已经被访问过,Scrapy认为它已经完成了。
因此,您必须使用dont_filter=True参数重新发送这些请求:

for priority, url in enumerate(self.start_urls):
    print ("Processing", url)
    yield Request(url=url, dont_filter=True, priority=priority, callback=self.parse)
    #                      ^^^^^^^^^^^^^^^^  notice us forcing the Dupefilter to
    #                                        ignore duplicate requests to these pages

关于更好的解决方案而不是这种笨拙的方法,请考虑使用InitSpider(例如,存在其他方法)。这样可以确保您的“初始”工作已经完成并且可以依赖。
(出于某种原因,该类从未在Scrapy docs中进行过文档记录,但是它是一个相对简单的Spider子类:在开始实际运行之前进行一些初始工作。)

这是一个代码示例:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders.init import InitSpider

class QuotesSpider(InitSpider):
    name = 'quotes'
    allowed_domains = ['website.com']
    start_urls = ['https://website.com']

    # Without this method override, InitSpider behaves like Spider.
    # This is used _instead of_ start_requests. (Do not override start_requests.)
    def init_request(self):
        # The last request that finishes the initialization needs
        # to have the `self.initialized()` method as callback.
        url = self.start_urls[0] + '/js-file-with-long-lat.js'
        yield scrapy.Request(url, callback=self.parse_long_lat, dont_filter=True)

    def parse_long_lat(self, response):
        """ The callback for our init request. """
        print ("called long lat")

        # do some work and maybe return stuff
        self.latitude = None
        self.longitude = None
        #yield stuff_here

        # Finally, start our run.
        return self.initialized()
        # Now we are "initialized", will process `start_urls`
        # and continue from there.

    def parse(self, response):
        print ("Inside parse")
        print ("Everything is good, do parse_cats stuff here")

这将导致如下输出:

2019-01-10 20:36:20 [scrapy.core.engine] INFO: Spider opened
2019-01-10 20:36:20 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-10 20:36:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1/js-file-with-long-lat.js> (referer: None)
called long lat
2019-01-10 20:36:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://127.0.0.1> (referer: http://127.0.0.1/js-file-with-long-lat.js/)
Inside parse
Everything is good, do parse_cats stuff here
2019-01-10 20:36:21 [scrapy.core.engine] INFO: Closing spider (finished)

答案 1 :(得分:0)

因此,我终于通过一种解决方法对其进行了处理: 我检查在response.url中收到的parse()是什么,并基于此将进一步的解析发送到相应的方法:

def start_requests(self):
        self.start_urls = [
            self.home_url,
            self.home_url + 'js-file-with-long-lat.js'
        ]
        for priority, url in enumerate(self.start_urls):
            yield Request(url=url, priority=priority, callback=self.parse)

def parse(self, response):
    if response.url != self.home_url:
        for _ in self.parse_long_lat(response):
            yield _
    else:
        for cat in self.parse_cats(response):
            yield cat