如果蜘蛛得到重定向,那么它应该再次请求,但使用不同的参数。 第二个请求中的回调未执行。
如果我在urls
和start
方法中使用不同的checker
,则可以正常使用。我认为请求正在使用lazy loads
,这就是为什么我的代码不起作用,但不确定。
from scrapy.http import Request
from scrapy.spider import BaseSpider
class TestSpider(BaseSpider):
def start(self, response):
return Request(url = 'http://localhost/', callback=self.checker, meta={'dont_redirect': True})
def checker(self, response):
if response.status == 301:
return Request(url = "http://localhost/", callback=self.results, meta={'dont_merge_cookies': True})
else:
return self.results(response)
def results(self, response):
# here I work with response
答案 0 :(得分:3)
不确定你是否还需要这个,但我已经举了一个例子。如果你有一个特定的网站,我们绝对可以看看它。
from scrapy.http import Request
from scrapy.spider import BaseSpider
class TestSpider(BaseSpider):
name = "TEST"
allowed_domains = ["example.com", "example.iana.org"]
def __init__(self, **kwargs):
super( TestSpider, self ).__init__(**kwargs)\
self.url = "http://www.example.com"
self.max_loop = 3
self.loop = 0 # We want it to loop 3 times so keep a class var
def start_requests(self):
# I'll write it out more explicitly here
print "OPEN"
checkRequest = Request(
url = self.url,
meta = {"test":"first"},
callback = self.checker
)
return [ checkRequest ]
def checker(self, response):
# I wasn't sure about a specific website that gives 302
# so I just used 200. We need the loop counter or it will keep going
if(self.loop<self.max_loop and response.status==200):
print "RELOOPING", response.status, self.loop, response.meta['test']
self.loop += 1
checkRequest = Request(
url = self.url,
callback = self.checker
).replace(meta = {"test":"not first"})
return [checkRequest]
else:
print "END LOOPING"
self.results(response) # No need to return, just call method
def results(self, response):
print "DONE" # Do stuff here
在settings.py中,设置此选项
DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter'
这实际上是关闭重复站点请求的过滤器。这很混乱,因为BaseDupeFilter实际上并不是默认值,因为它没有真正过滤任何东西。这意味着我们将提交3个不同的请求,这些请求将循环通过checker方法。另外,我正在使用scrapy 0.16:
>scrapy crawl TEST
>OPEN
>RELOOPING 200 0 first
>RELOOPING 200 1 not first
>RELOOPING 200 2 not first
>END LOOPING
>DONE