Question

我最近在网站蜘蛛上工作，并注意到它要求无限数量的页面，因为一个网站没有编码他们的页码永远停止。

因此虽然他们只有几页内容，但它仍然会生成下一个链接和网址...？page = 400，...？page = 401等。

内容没有变化，只是网址。当内容停止变化时，有没有办法让Scrapy停止分页？或者我可以编写自定义的东西。

Answer 1

如果内容没有更改，您可以将当前页面的内容与上一页进行比较，如果相同，则中断抓取。

例如：

def parse(self, response):
    product_urls = response.xpath("//a/@href").extract()
    # check last page
    if response.meta.get('prev_urls') == product_urls:
        logging.info('reached the last page at: {}'.format(response.url))
        return  # reached the last page
    # crawl products
    for url in product_urls:
        yield Request(url, self.parse_product)
    # create next page url
    next_page = response.meta.get('page', 0) + 1
    next_url = re.sub('page=\d+', 'page={}'.format(next_page), response.url)
    # now for the next page carry some data in meta
    yield Request(next_url, 
                  meta={'prev_urls': product_urls,
                        'page': next_page}

Scrapy - 如何避免分页黑洞？

1 个答案: