Question

我有这段代码，我需要关注我的函数parse_with_additional_info

中的所有分页链接

start_urls = ['http://example.com']

def parse_start_url(self, response):
    sel = Selector(response)
    aa = sel.xpath('//h3/a...../@href').extract()
    for a in aa:
        yield Request(url = a, callback=self.parse_additional_info)

def parse_additional_info(self, response):
    sel = Selector(response)
    nextPageLinks=sel.xpath("//a[text([contains(.,'Next')]]/@href").extract()

请注意：我已经尝试过scrapy规则，但由于它有一系列回调，因此无效。

Answer 1

我自己找到了答案。我不得不使用响应对象的urljoin方法和nextPageLinks url并回调相同的函数，直到没有页面离开。以下是可能有助于某个具有相同场景的代码的代码。

def parse_additional_info(self, response):
 .
 .

if nextPageLinks: 
   url = response.urljoin(nextPageLinks[0]) 
   yield Request(url = url, callback=self.parse_additional_info)

Scrapy在二级回调中遵循分页

1 个答案: