Question

我正在使用scrapy从亚马逊网站上刮取数据，当我使用选择器小工具显示具有标题类的路径时，它不会提取该标题。相反，当我为课程使用{.s-access-title}时，它就可以工作了。我不确定选择器小工具为什么显示错误的路径。

import scrapy
from ..items import AmazonsItem


class AmazonSpiderSpider(scrapy.Spider):
    name = 'amazon_spider'
    start_urls = \['https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6'\]

    def parse(self, response):

        items =  AmazonsItem()

        product_name = response.css('.s-access-title').extract()][1]

amazon page 如果您看这张图片，我仅选择标题，但是它具有不同的类别，并且在使用该类别时不起作用。那么如何从中提取特定的班级标题呢？如果您有使用选择器小工具的经验，请看看。另外，如果有人对提取方法有其他想法，请告诉。

Answer 1

尝试以下方法：标题位于data-attribute中：

import scrapy
from ..items import AmazonsItem

class AmazonSpiderSpider(scrapy.Spider):
    name = 'amazon_spider'
    start_urls = ['https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6']

    def parse(self, response):
        items =  AmazonsItem()
        products_name = response.css('.s-access-title::attr("data-attribute")').extract()
        for product_name in products_name:
            print(product_name)
        next_page = response.css('li.a-last a::attr(href)').get()
            if next_page is not None:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse)

输出：

'Murder on the Orient Express (Poirot)'
'And Then There Were None'
.
.

我无法从网站上抓取特定标题

1 个答案:

输出：