Question

我正在尝试在下面的link中对Scrapy进行尝试。

只有3个结果页面，带有给定的过滤器。在最后一页中，下一个链接不再有效。但是，scrappy会不停地抓取内容。即使检查下一个链接是否可用，也可以继续进行。我无法找出问题所在。我的输出也有数据。

The next link active in first page

The next link inactive in the last page

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "worldcat"        
    start_urls = [
        'https://www.worldcat.org/search?q=computer+science&qt=results_page#%2528x0%253Aaudiobook%2Bx4%253Alp%2529format',
    ]

    def parse(self, response):
        for book in response.css('.menuElem'):
            yield {
                'title': book.css('.details .name a strong::text').get(),
                'author': book.css('.details .author::text').get(),
                'publisher': book.css('.details .publisher .itemPublisher::text').get(),
            }

        next_page = response.xpath('/html//td[@align = "right"]/a[.="Next"]/@href').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

我也尝试在没有帮助的情况下将next_page = ""添加到if next_page代码块的末尾。

但是，由于某种原因，它可以在此link中使用。

Answer 1

Scrapy看到页面的源代码，例如JavaScript已被禁用。由于您要抓取的页面具有AJAX分页，因此当禁用JS时，其行为可能会有所不同。

当您禁用JS并输入page you want to scrape时，您会看到6,339,303的结果，因此似乎无限刮擦。

分页结束后，Scrapy仍不会停止抓取

1 个答案: