我正在努力了解有关亚马逊上列出的手机的详细信息。来自此链接:here使用scrapy。
这是我的代码:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Selector
from tars.items import ProductNameItem
import re as r
class Namespider(CrawlSpider):
name = "flash"
allowed_domains = ["amazon.in"]
def __init__(self, *args, **kwargs):
super(Namespider, self).__init__(*args, **kwargs)
self.start_urls = [kwargs.get('start_url')]
rules = (
Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@id="pagnNextLink"]')), callback="parse_start_url", follow= True),
)
def parse_start_url(self, response):
hxs = Selector(response)
titles = hxs.xpath('//li[@class="s-result-item celwidget "]')
items = []
for i in titles:
item = ProductNameItem()
#x-paths:
name_xpath = "div[1]/div[3]/div[1]/a[1]/h2[1]/text()"
url_xpath = "div[1]/div[3]/div[1]/a[1]/@href"
price_xpath = "div[1]/div[5]/div[1]/a[1]/span[1]/text()"
total_reviews_xpath = "div[1]/div[4]/a[1]/text()"
#data-extraction:
item["name"] = ' '.join(i.xpath(name_xpath).extract())
item["url"] = ' '.join(i.xpath(url_xpath).extract())
item["price"] = ' '.join(i.xpath(price_xpath).extract())
item["total_reviews"] = ' '.join(i.xpath(total_reviews_xpath).extract())
#append all data
items.append(item)
return(items)
代码工作正常,但我没有获得 price和total_reviews 字段的任何数据。我多次交叉检查,x路径也是正确的,但我进一步探讨了“a' a'和' span'那些x路径中的标签。这些标签中的内容是使用ajax或类似的东西加载的。 如果有人可以就如何从这些html标签中抓取数据提供一些帮助。
答案 0 :(得分:0)
没有加载ajax调用,你的xpath错误,以下css选择器获取所有评论:
In [13]: response.css("#container a[href*='#customerReviews']::text").extract()
Out[13]:
[u'9,812',
u'32',
u'32',
u'17,301',
u'1,408',
u'99',
u'9,816',
u'9,808',
u'17,298',
u'91',
u'91',
u'8,351',
u'9,585',
u'9,808',
u'10,223',
u'174',
u'809',
u'671',
u'5,215',
u'5,215',
u'1,776',
u'462',
u'671',
u'1,147']
s-item-container 类的名称,价格,链接和评论数都在 divs 中:
In [24]: divs = response.css("div.s-item-container")
In [25]: for d in divs:
anchor = d.css("a.a-link-normal.s-access-detail-page.a-text-normal")[0]
name = anchor.xpath("./h2/@data-attribute").extract_first()
reviews = d.css("a[href*='#customerReviews']::text").extract_first()
a = d.css("a.a-link-normal.a-text-normal")[0]
link = a.xpath("@href").extract_first()
price = d.css("span.a-size-base.a-color-price.s-price.a-text-bold::text").extract_first()
print(name, price, reviews, link)
....:
(u'Moto G Plus, 4th Gen (Black, 32 GB)', u'14,999.00', u'9,813', u'http://www.amazon.in/Moto-Plus-4th-Gen-Black/dp/B01DDP7GZK')
(u'Moto G, 4th Gen (Black, 16GB)', u'12,499.00', u'32', u'http://www.amazon.in/Moto-4th-Gen-Black-16GB/dp/B01DDP7YI4')
(u'Moto G, 4th Gen (White, 16GB)', u'12,499.00', u'32', u'http://www.amazon.in/Moto-4th-Gen-White-16GB/dp/B01DDP7GR8')
(u'Lenovo Vibe K4 Note (Black, 16GB)', u'10,999.00', u'17,301', u'http://www.amazon.in/Lenovo-Vibe-K4-Note-Black/dp/B01A11D2U2')
(u'OnePlus 3 (Graphite, 64GB)', u'27,999.00', u'1,408', u'http://www.amazon.in/OnePlus-3-Graphite-64GB/dp/B01DDP7UQ0')
(u'Lenovo Vibe K5 (Gold, 16GB)', u'6,999.00', u'100', u'http://www.amazon.in/Lenovo-Vibe-K5-Gold-16GB/dp/B01DDP7UYC')
(u'Moto G Plus, 4th Gen (White, 32 GB)', u'14,999.00', u'9,821', u'http://www.amazon.in/Moto-Plus-4th-Gen-White/dp/B01DDP85BY')
(u'Moto G Plus, 4th Gen (Black, 16 GB)', u'13,499.00', u'9,819', u'http://www.amazon.in/Moto-Plus-4th-Gen-Black/dp/B01DDP87N0')
(u'Lenovo Vibe K4 Note (White,16GB)', u'10,999.00', u'17,302', u'http://www.amazon.in/Lenovo-Vibe-K4-Note-White/dp/B01BHUN4S6')
(u'Lenovo Vibe K5 (Silver, 16GB)', u'6,999.00', u'107', u'http://www.amazon.in/Lenovo-Vibe-K5-Silver-16GB/dp/B01DDP7D3A')
(u'Xiaomi Redmi Note 3 (Silver, 32GB)', u'11,999.00', u'1,227', u'http://www.amazon.in/Xiaomi-Redmi-Note-Silver-32GB/dp/B01DK5K8WG')
(u'Lenovo Vibe K5 (Grey, 16GB)', u'6,999.00', u'105', u'http://www.amazon.in/Lenovo-Vibe-K5-Grey-16GB/dp/B01DDP7MFE')
(u'OnePlus X (Onyx, 16GB)', u'14,999.00', u'9,585', u'http://www.amazon.in/OnePlus-E1003-X-Onyx-16GB/dp/B016UPKCGU')
(u'Moto G Plus, 4th Gen (White, 16 GB)', u'13,499.00', u'9,819', u'http://www.amazon.in/Moto-Plus-4th-Gen-White/dp/B01DDP85KU')
(u'Coolpad Note 3 (Black, 16GB)', u'8,499.00', u'10,223', u'http://www.amazon.in/Coolpad-Note-3-Black-16GB/dp/B0158IT7ES')
(u'Intex Aqua Speed HD (White-Champagne, 8GB)', u'4,190.00', u'174', u'http://www.amazon.in/Intex-Aqua-Speed-HD-White-Champagne/dp/B01FD7QTEK')
(u'Asus Zenfone Max ZC550KL-6A068IN (Black, 2GB, 16GB)', u'8,999.00', u'810', u'http://www.amazon.in/Asus-Zenfone-ZC550KL-6A068IN-Black-16GB/dp/B018VKZPG4')
(u'Coolpad Note 3 Plus (Champagne-White)', u'8,999.00', u'670', u'http://www.amazon.in/Coolpad-Note-3-Plus-Champagne-White/dp/B01DDP7V7S')
(u'Redmi 2 (White)', u'5,999.00', u'5,215', u'http://www.amazon.in/Mi-Redmi-2-White/dp/B00VEB055E')
(u'Lenovo Vibe X3 (White, 32GB)', u'19,999.00', u'1,776', u'http://www.amazon.in/Lenovo-Vibe-X3-White-32GB/dp/B01AY3H9QA')
(u'XOLO Era X (Black)', u'10,000.00', u'462', u'http://www.amazon.in/XOLO-ERA-X-Era-Black/dp/B01BWL1A0O')
(u'Coolpad Note 3 Plus (Gold)', u'8,999.00', u'671', u'http://www.amazon.in/Coolpad-Note-3-Plus-Gold/dp/B01DDP7DK8')
(u'HTC Desire 620G (Santroni White)', u'8,699.00', u'1,147', u'http://www.amazon.in/HTC-Desire-620G-Santroni-White/dp/B00R7FPSDU')
(u'Samsung Tizen Z3 (Silver)', u'5,590.00', u'9', u'http://www.amazon.in/Samsung-Tizen-Z3-Silver/dp/B01CXXJ8UY')
总是尝试使用类名,属性等来查找您的内容,如果您想测试xpath,请在不在浏览器中的scrapy shell中进行测试。