无法从亚马逊上刮取产品标题

时间:2020-05-24 04:50:11

标签: python css scrapy

我正在使用Scrapy来获取this Amazon website上产品的价格和标题。提取价格没有问题,但标题有问题。区别在于我在class属性中看到“ aria-hidded = true”。这是示例。

verticesRect[0] = 3;

这是css选择器命令:

verticesRect = {
            // Positions        // Normal Coords          // Texture Coords
            0.0,  height, 0.0f,    0.0 , 0.0, 1.0 ,     1.0f, 0.0f,   // Top Right
            0.0, -height, 0.0f,    0.0 , 0.0, 1.0 ,     1.0f, 1.0f,   // Bottom Right
           -10.0, -height, 0.0f,    0.0 , 0.0, 1.0 ,     0.0f, 1.0f,   // Bottom Left
           -10.0, height, 0.0f,    0.0 , 0.0, 1.0 ,     0.0f, 0.0f    // Top Left 
        };

我可以知道提取文本的CSS选择器应该是什么。谢谢

3 个答案:

答案 0 :(得分:0)

您可以通过 XPATH 解决它。 转到xpather,然后将您的html发送到此处,然后提取您的xpath模式。

import scrapy
from scrapy import Spider
class SSDSpider(scrapy.Spider):
    name = "SSD_spider"
    start_urls = ['https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/ref=zg_bs_nav_0']
    DOWNLOAD_DELAY = 10
    def parse(self, response):
        yield {
                'title': response.xpath('//div[@class="p13n-sc-truncated"][1]').extract(),
              }

enter image description here

尝试使用美丽的汤:

pip install beautifulsoup4
pip install lxml 
apt-get install python-lxml

Beautiful Soup也依赖于解析器,默认值为lxml

import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://your_amazon_link/product/').read()
soup = bs.BeautifulSoup(source,'lxml')
for title in soup.select("ol#zg-ordered-list > li"):
    title_name = title.select_one(".p13n-sc-truncated").get_text()
    print(title_name)

答案 1 :(得分:0)

如果您查看html源代码(ctrl + u),您会发现产品标题中确实还有另一个p13n-sc-line-clamp-2类,可以很好地工作。因此,您的CSS选择器可能如下所示:

response.css('.p13n-sc-line-clamp-2::text').get().strip()

这是一个最小的工作示例:

from scrapy.spiders import CrawlSpider

class amaSpider(CrawlSpider):
    name = 'amatitle'
    start_urls = ['https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/']

    def parse(self, response):
        yield{'title': response.css('.p13n-sc-line-clamp-2::text').get().strip()}

如果要提取所有标题并将其从开头和结尾的空格中去除,请将解析功能更改为以下内容:

    def parse(self, response):
        titles = response.css('.p13n-sc-line-clamp-2::text').getall()
        titles_strip = [x.strip() for x in titles]
        yield{'titles': titles_strip}

答案 2 :(得分:0)

您的代码很好:

>>> from parsel import Selector
>>> selector = Selector(text='<div class="p13n-sc-truncated" aria-hidden="true" data-rows="2" title="Igloo ICEB26HNAQ Automatic Self-Cleaning Portable Electric Countertop Ice Maker Machine With Handle, 26 Pounds in 24 Hours, 9 Ice Cubes Ready in 7 minutes, With Ice Scoop and Basket">Igloo ICEB26HNAQ Automatic Self-Cleaning Portable Electric Countertop Ice Maker Machine…</div>')
>>> selector.css('.p13n-sc-truncated').css('::text').extract()
['Igloo ICEB26HNAQ Automatic Self-Cleaning Portable Electric Countertop Ice Maker Machine…']

我的猜测是,响应中不包含预期的HTML。如果这是亚马逊,那是极有可能的。他们有相当多的反机器人措施。