Question

import scrapy
class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['www.onthemarket.com']
    start_urls = ['https://www.onthemarket.com/for-sale/property/london/']
    def parse(self, response):
        next_page_url = response.css("li > a.arrow::attr(href)").extract_first()

        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(url=next_page_url, callback=self.parse)

        print(next_page_url)

我需要一个包含下一页所有链接的列表。如何遍历所有分页链接并用scrapy提取它们？他们都有class = arrow。

Answer 1

要在使用var a = (new List<int>()).GetType(); if ( a.IsGenericType && a is IList){} // both conditions return false时查找并准备好链接，我始终建议您使用LinkExtractor：

scrapy

您可以将它与许多不同的过滤器（如正则表达式，xpath）一起使用，甚至可以确定链接的确切位置（默认情况下会找到from scrapy.linkextractors import LinkExtractor ... def parse(self, response): ... le = LinkExtractor(restrict_css=['li > a.arrow']) for link in le.extract_links(response): yield Request(link.url, callback=self.parse)个标记）

Answer 2

使用.extract_first()，您始终可以获得分页中的第一个链接，该链接指向第一页或第二页。

使用.extract()[-1]，您会在分页中获得指向下一页的最后一个链接。

next_page_url = response.css("li > a.arrow::attr(href)").extract()[-1]

编辑，或者您可以使用CSS选择器:last-child（使用.extract_first()）

next_page_url = response.css("li > a.arrow:last-child::attr(href)").extract_first()

编辑：或使用xpath和[last()]

next_page_url = response.xpath('(//li/a[@class="arrow"]/@href)[last()]').extract_first()

或

next_page_url = response.xpath('(//li/a[@class="arrow"])[last()]/@href').extract_first()

使用scrapy将所有分页链接提取到页面？

2 个答案: