我有以下网址:http://somedomain.mytestsite.com/?offset=0。我想通过递增offset参数遍历这个url,比方说每次100。每次我收到回复,我都需要检查一些条件来决定是否应该运行下一次迭代。例如:
class SomeSpider(BaseSpider):
name = 'somespider'
offset = 0
items = list()
def start_requests(self):
return [scrapy.Request("http://somedomain.mytestsite.com/?offset="+str(self.offset), callback=self.request_iterator)]
def request_iterator(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if data["matches"]:
self.items.extend(data["matches"])
self.offset += 100
return self.start_requests()
else:
#process collected data in items list
return self.do_something_with_items()
这很有效,但我不禁对这段代码感到有些不对劲。也许我应该使用一些scrapy的rules
?
答案 0 :(得分:1)
以下事情可以改进:
1)不要将项目保留为蜘蛛属性,您将使用更大的输入消耗极高的内存量,使用python生成器。当您使用生成器时,您可以毫无困难地从一个蜘蛛回调中生成项目和请求。
2)start_requests
用于蜘蛛启动,似乎没有必要在你的代码中覆盖它们,如果你重命名你的方法来解析(默认方法名称作为回调执行start_requests)代码会更多可读
# we should process at least one item otherwise data["matches"] will be empty.
start_urls = ["http://somedomain.mytestsite.com/?offset="+1]
def parse(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if data["matches"]:
for x in data["matches"]:
yield self.process_your_item(x)
self.offset += 100
yield self.next_request()
else:
#process collected data in items list
for x self.do_something_with_items():
yield x
def next_request(self):
return scrapy.Request("http://somedomain.mytestsite.com/?offset="+str(self.offset))
可能更好的回调版本是:
def parse(self, response):
body = response.body
#let's say we get json as response data
data = json.loads(body)
#check if page still have data to process
if not data["matches"]:
self.logger.info("processing done")
return
for x in data["matches"]:
yield self.process_your_item(x)
self.offset += 100
yield self.next_request()