我正在使用Scrapy来获取this Amazon website上产品的价格和标题。提取价格没有问题,但标题有问题。区别在于我在class属性中看到“ aria-hidded = true”。这是示例。
verticesRect[0] = 3;
这是css选择器命令:
verticesRect = {
// Positions // Normal Coords // Texture Coords
0.0, height, 0.0f, 0.0 , 0.0, 1.0 , 1.0f, 0.0f, // Top Right
0.0, -height, 0.0f, 0.0 , 0.0, 1.0 , 1.0f, 1.0f, // Bottom Right
-10.0, -height, 0.0f, 0.0 , 0.0, 1.0 , 0.0f, 1.0f, // Bottom Left
-10.0, height, 0.0f, 0.0 , 0.0, 1.0 , 0.0f, 0.0f // Top Left
};
我可以知道提取文本的CSS选择器应该是什么。谢谢
答案 0 :(得分:0)
您可以通过 XPATH 解决它。 转到xpather,然后将您的html发送到此处,然后提取您的xpath模式。
import scrapy
from scrapy import Spider
class SSDSpider(scrapy.Spider):
name = "SSD_spider"
start_urls = ['https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/ref=zg_bs_nav_0']
DOWNLOAD_DELAY = 10
def parse(self, response):
yield {
'title': response.xpath('//div[@class="p13n-sc-truncated"][1]').extract(),
}
尝试使用美丽的汤:
pip install beautifulsoup4
pip install lxml
apt-get install python-lxml
Beautiful Soup也依赖于解析器,默认值为lxml
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('https://your_amazon_link/product/').read()
soup = bs.BeautifulSoup(source,'lxml')
for title in soup.select("ol#zg-ordered-list > li"):
title_name = title.select_one(".p13n-sc-truncated").get_text()
print(title_name)
答案 1 :(得分:0)
如果您查看html源代码(ctrl + u),您会发现产品标题中确实还有另一个p13n-sc-line-clamp-2
类,可以很好地工作。因此,您的CSS选择器可能如下所示:
response.css('.p13n-sc-line-clamp-2::text').get().strip()
这是一个最小的工作示例:
from scrapy.spiders import CrawlSpider
class amaSpider(CrawlSpider):
name = 'amatitle'
start_urls = ['https://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/']
def parse(self, response):
yield{'title': response.css('.p13n-sc-line-clamp-2::text').get().strip()}
如果要提取所有标题并将其从开头和结尾的空格中去除,请将解析功能更改为以下内容:
def parse(self, response):
titles = response.css('.p13n-sc-line-clamp-2::text').getall()
titles_strip = [x.strip() for x in titles]
yield{'titles': titles_strip}
答案 2 :(得分:0)
您的代码很好:
>>> from parsel import Selector
>>> selector = Selector(text='<div class="p13n-sc-truncated" aria-hidden="true" data-rows="2" title="Igloo ICEB26HNAQ Automatic Self-Cleaning Portable Electric Countertop Ice Maker Machine With Handle, 26 Pounds in 24 Hours, 9 Ice Cubes Ready in 7 minutes, With Ice Scoop and Basket">Igloo ICEB26HNAQ Automatic Self-Cleaning Portable Electric Countertop Ice Maker Machine…</div>')
>>> selector.css('.p13n-sc-truncated').css('::text').extract()
['Igloo ICEB26HNAQ Automatic Self-Cleaning Portable Electric Countertop Ice Maker Machine…']
我的猜测是,响应中不包含预期的HTML。如果这是亚马逊,那是极有可能的。他们有相当多的反机器人措施。