如何使用scrapy在两个顺序请求中进行回调

时间:2013-05-16 14:25:14

标签: python scrapy

如果蜘蛛得到重定向,那么它应该再次请求,但使用不同的参数。 第二个请求中的回调未执行。

如果我在urlsstart方法中使用不同的checker,则可以正常使用。我认为请求正在使用lazy loads,这就是为什么我的代码不起作用,但不确定。

from scrapy.http import Request
from scrapy.spider import BaseSpider

class TestSpider(BaseSpider):

    def start(self, response):
        return Request(url = 'http://localhost/', callback=self.checker, meta={'dont_redirect': True})

    def checker(self, response):
        if response.status == 301:
            return Request(url = "http://localhost/", callback=self.results, meta={'dont_merge_cookies': True})
        else:
            return self.results(response)

    def results(self, response):
        # here I work with response

1 个答案:

答案 0 :(得分:3)

不确定你是否还需要这个,但我已经举了一个例子。如果你有一个特定的网站,我们绝对可以看看它。

from scrapy.http import Request
from scrapy.spider import BaseSpider

class TestSpider(BaseSpider):

    name = "TEST"
    allowed_domains = ["example.com", "example.iana.org"]

    def __init__(self, **kwargs):
        super( TestSpider, self ).__init__(**kwargs)\
        self.url      = "http://www.example.com"
        self.max_loop = 3
        self.loop     = 0  # We want it to loop 3 times so keep a class var

    def start_requests(self):
        # I'll write it out more explicitly here
        print "OPEN"                       
        checkRequest = Request( 
            url      = self.url, 
            meta     = {"test":"first"},
            callback = self.checker 
        )
        return [ checkRequest ]

    def checker(self, response):
        # I wasn't sure about a specific website that gives 302 
        # so I just used 200. We need the loop counter or it will keep going

        if(self.loop<self.max_loop and response.status==200): 
            print "RELOOPING", response.status, self.loop, response.meta['test']
            self.loop += 1

            checkRequest = Request(
                url = self.url,
                callback = self.checker
            ).replace(meta = {"test":"not first"})
            return [checkRequest]
        else:
            print "END LOOPING"
            self.results(response) # No need to return, just call method

    def results(self, response):
        print "DONE"  # Do stuff here

在settings.py中,设置此选项

DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter'

这实际上是关闭重复站点请求的过滤器。这很混乱,因为BaseDupeFilter实际上并不是默认值,因为它没有真正过滤任何东西。这意味着我们将提交3个不同的请求,这些请求将循环通过checker方法。另外,我正在使用scrapy 0.16:

>scrapy crawl TEST
>OPEN
>RELOOPING 200 0 first
>RELOOPING 200 1 not first
>RELOOPING 200 2 not first
>END LOOPING
>DONE