Question

设置向上

我使用给定here的scrapy示例抓取住房广告。

在我的情况下，我会关注住房广告页面的链接，而不是作者页面，然后废弃住房广告页面以获取信息。

<小时/> 的问题

我的代码成功跟随指向住房广告页面的链接，并抓取每个广告的信息。但是，它仅对初始页面这样做，即它不遵循分页链接。

<小时/> 到目前为止的代码

class RoomsSpider(scrapy.Spider):
    name = 'rooms'

    start_urls = ['https://www.spareroom.co.uk/flatshare/london']

    def parse(self, response):
        # follow links to ad pages
        for href in response.xpath(
            '//*[@id="maincontent"]/ul/li/article/header[1]',
            ).css('a::attr(href)').extract():
            yield scrapy.Request(response.urljoin(href),
                             callback=self.parse_ad)

        # follow pagination links 
        next_page = response.xpath(
            '//*[@id="maincontent"]/div[2]/ul[2]/li/strong/a/@href',
            ).extract()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)      

     def parse_ad(self, response):
     # code extracting ad information follows here, 
     # finalising the code with a yield function.

所以，我基本上都是遵循这个例子。运行代码时，我没有收到关于分页链接部分的错误，并且查询路径是正确的（我相信）。

我是否已将# follow pagination links部分正确放入代码中？我输了。

Answer 1

看起来是一个愚蠢的错误：

 for href in response.xpath(
        '//*[@id="maincontent"]/ul/li/article/header[1]',
        ).css('a::attr(href)').extract():

提供包含分页href的单元素列表，例如['\href']。但是为了使代码工作，需要一个字符串，例如'\href'。因此，在上面的代码段中，将extract()替换为extract()[0]。

Scrapy：关注不起作用的分页链接

1 个答案: