Question

我有以下网址：http://somedomain.mytestsite.com/?offset=0。我想通过递增offset参数遍历这个url，比方说每次100。每次我收到回复，我都需要检查一些条件来决定是否应该运行下一次迭代。例如：

class SomeSpider(BaseSpider):
name = 'somespider'

offset = 0
items = list()

def start_requests(self):
    return [scrapy.Request("http://somedomain.mytestsite.com/?offset="+str(self.offset), callback=self.request_iterator)]

def request_iterator(self, response):
    body = response.body
    #let's say we get json as response data
    data = json.loads(body)
    #check if page still have data to process
    if data["matches"]:
        self.items.extend(data["matches"])
        self.offset += 100
        return self.start_requests()
    else:
        #process collected data in items list
        return self.do_something_with_items()

这很有效，但我不禁对这段代码感到有些不对劲。也许我应该使用一些scrapy的rules？

Answer 1

以下事情可以改进：

1）不要将项目保留为蜘蛛属性，您将使用更大的输入消耗极高的内存量，使用python生成器。当您使用生成器时，您可以毫无困难地从一个蜘蛛回调中生成项目和请求。

2）start_requests用于蜘蛛启动，似乎没有必要在你的代码中覆盖它们，如果你重命名你的方法来解析（默认方法名称作为回调执行start_requests）代码会更多可读

# we should process at least one item otherwise data["matches"] will be empty.
start_urls = ["http://somedomain.mytestsite.com/?offset="+1]

def parse(self, response):
    body = response.body
    #let's say we get json as response data
    data = json.loads(body)
    #check if page still have data to process
    if data["matches"]:
        for x in data["matches"]:
            yield self.process_your_item(x)
        self.offset += 100
        yield self.next_request()
    else:
        #process collected data in items list
        for x self.do_something_with_items():
            yield x

 def next_request(self):
     return scrapy.Request("http://somedomain.mytestsite.com/?offset="+str(self.offset))

可能更好的回调版本是：

def parse(self, response):
    body = response.body
    #let's say we get json as response data
    data = json.loads(body)
    #check if page still have data to process
    if not data["matches"]:
        self.logger.info("processing done")
        return
    for x in data["matches"]:
        yield self.process_your_item(x)
    self.offset += 100
    yield self.next_request()

在Scrapy中遍历url params模板

1 个答案: