在Scrapy中遍历url params模板

时间:2015-12-22 11:55:16

标签: python scrapy scrapy-spider

我有以下网址:http://somedomain.mytestsite.com/?offset=0。我想通过递增offset参数遍历这个url,比方说每次100。每次我收到回复,我都需要检查一些条件来决定是否应该运行下一次迭代。例如:

class SomeSpider(BaseSpider):
name = 'somespider'

offset = 0
items = list()

def start_requests(self):
    return [scrapy.Request("http://somedomain.mytestsite.com/?offset="+str(self.offset), callback=self.request_iterator)]

def request_iterator(self, response):
    body = response.body
    #let's say we get json as response data
    data = json.loads(body)
    #check if page still have data to process
    if data["matches"]:
        self.items.extend(data["matches"])
        self.offset += 100
        return self.start_requests()
    else:
        #process collected data in items list
        return self.do_something_with_items()

这很有效,但我不禁对这段代码感到有些不对劲。也许我应该使用一些scrapy的rules

1 个答案:

答案 0 :(得分:1)

以下事情可以改进:

1)不要将项目保留为蜘蛛属性,您将使用更大的输入消耗极高的内存量,使用python生成器。当您使用生成器时,您可以毫无困难地从一个蜘蛛回调中生成项目和请求。

2)start_requests用于蜘蛛启动,似乎没有必要在你的代码中覆盖它们,如果你重命名你的方法来解析(默认方法名称作为回调执行start_requests)代码会更多可读

# we should process at least one item otherwise data["matches"] will be empty.
start_urls = ["http://somedomain.mytestsite.com/?offset="+1]

def parse(self, response):
    body = response.body
    #let's say we get json as response data
    data = json.loads(body)
    #check if page still have data to process
    if data["matches"]:
        for x in data["matches"]:
            yield self.process_your_item(x)
        self.offset += 100
        yield self.next_request()
    else:
        #process collected data in items list
        for x self.do_something_with_items():
            yield x

 def next_request(self):
     return scrapy.Request("http://somedomain.mytestsite.com/?offset="+str(self.offset))

可能更好的回调版本是:

def parse(self, response):
    body = response.body
    #let's say we get json as response data
    data = json.loads(body)
    #check if page still have data to process
    if not data["matches"]:
        self.logger.info("processing done")
        return
    for x in data["matches"]:
        yield self.process_your_item(x)
    self.offset += 100
    yield self.next_request()