从html标签中抓取数据,使用scrapy通过ajax接收文本

时间:2016-07-04 18:06:58

标签: python html ajax web-scraping scrapy

我正在努力了解有关亚马逊上列出的手机的详细信息。来自此链接:here使用scrapy。

这是我的代码:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Selector
from tars.items import ProductNameItem
import re as r

class Namespider(CrawlSpider):
    name = "flash"
    allowed_domains = ["amazon.in"]
    def __init__(self, *args, **kwargs):
        super(Namespider, self).__init__(*args, **kwargs)
        self.start_urls = [kwargs.get('start_url')]

    rules = (
        Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@id="pagnNextLink"]')), callback="parse_start_url", follow= True),
)

    def parse_start_url(self, response):
        hxs = Selector(response)
        titles = hxs.xpath('//li[@class="s-result-item  celwidget "]')

        items = []
        for i in titles:

            item = ProductNameItem()

            #x-paths:
            name_xpath = "div[1]/div[3]/div[1]/a[1]/h2[1]/text()"
            url_xpath = "div[1]/div[3]/div[1]/a[1]/@href"
            price_xpath = "div[1]/div[5]/div[1]/a[1]/span[1]/text()"
            total_reviews_xpath = "div[1]/div[4]/a[1]/text()"

            #data-extraction:
            item["name"] = ' '.join(i.xpath(name_xpath).extract())
            item["url"] = ' '.join(i.xpath(url_xpath).extract())
            item["price"] = ' '.join(i.xpath(price_xpath).extract())
            item["total_reviews"] = ' '.join(i.xpath(total_reviews_xpath).extract())

            #append all data
            items.append(item)


        return(items)

代码工作正常,但我没有获得 price和total_reviews 字段的任何数据。我多次交叉检查,x路径也是正确的,但我进一步探讨了“a' a'和' span'那些x路径中的标签。这些标签中的内容是使用ajax或类似的东西加载的。 如果有人可以就如何从这些html标签中抓取数据提供一些帮助。

1 个答案:

答案 0 :(得分:0)

没有加载ajax调用,你的xpath错误,以下css选择器获取所有评论:

In [13]: response.css("#container a[href*='#customerReviews']::text").extract()
Out[13]: 
[u'9,812',
 u'32',
 u'32',
 u'17,301',
 u'1,408',
 u'99',
 u'9,816',
 u'9,808',
 u'17,298',
 u'91',
 u'91',
 u'8,351',
 u'9,585',
 u'9,808',
 u'10,223',
 u'174',
 u'809',
 u'671',
 u'5,215',
 u'5,215',
 u'1,776',
 u'462',
 u'671',
 u'1,147']

s-item-container 类的名称,价格,链接和评论数都在 divs 中:

In [24]: divs = response.css("div.s-item-container")

In [25]: for d in divs:                             
            anchor = d.css("a.a-link-normal.s-access-detail-page.a-text-normal")[0]
            name = anchor.xpath("./h2/@data-attribute").extract_first()
            reviews = d.css("a[href*='#customerReviews']::text").extract_first()
            a = d.css("a.a-link-normal.a-text-normal")[0]
            link = a.xpath("@href").extract_first()
            price = d.css("span.a-size-base.a-color-price.s-price.a-text-bold::text").extract_first()
            print(name, price, reviews, link)
   ....:     
(u'Moto G Plus, 4th Gen (Black, 32 GB)', u'14,999.00', u'9,813', u'http://www.amazon.in/Moto-Plus-4th-Gen-Black/dp/B01DDP7GZK')
(u'Moto G, 4th Gen (Black, 16GB)', u'12,499.00', u'32', u'http://www.amazon.in/Moto-4th-Gen-Black-16GB/dp/B01DDP7YI4')
(u'Moto G, 4th Gen (White, 16GB)', u'12,499.00', u'32', u'http://www.amazon.in/Moto-4th-Gen-White-16GB/dp/B01DDP7GR8')
(u'Lenovo Vibe K4 Note (Black, 16GB)', u'10,999.00', u'17,301', u'http://www.amazon.in/Lenovo-Vibe-K4-Note-Black/dp/B01A11D2U2')
(u'OnePlus 3 (Graphite, 64GB)', u'27,999.00', u'1,408', u'http://www.amazon.in/OnePlus-3-Graphite-64GB/dp/B01DDP7UQ0')
(u'Lenovo Vibe K5 (Gold, 16GB)', u'6,999.00', u'100', u'http://www.amazon.in/Lenovo-Vibe-K5-Gold-16GB/dp/B01DDP7UYC')
(u'Moto G Plus, 4th Gen (White, 32 GB)', u'14,999.00', u'9,821', u'http://www.amazon.in/Moto-Plus-4th-Gen-White/dp/B01DDP85BY')
(u'Moto G Plus, 4th Gen (Black, 16 GB)', u'13,499.00', u'9,819', u'http://www.amazon.in/Moto-Plus-4th-Gen-Black/dp/B01DDP87N0')
(u'Lenovo Vibe K4 Note (White,16GB)', u'10,999.00', u'17,302', u'http://www.amazon.in/Lenovo-Vibe-K4-Note-White/dp/B01BHUN4S6')
(u'Lenovo Vibe K5 (Silver, 16GB)', u'6,999.00', u'107', u'http://www.amazon.in/Lenovo-Vibe-K5-Silver-16GB/dp/B01DDP7D3A')
(u'Xiaomi Redmi Note 3 (Silver, 32GB)', u'11,999.00', u'1,227', u'http://www.amazon.in/Xiaomi-Redmi-Note-Silver-32GB/dp/B01DK5K8WG')
(u'Lenovo Vibe K5 (Grey, 16GB)', u'6,999.00', u'105', u'http://www.amazon.in/Lenovo-Vibe-K5-Grey-16GB/dp/B01DDP7MFE')
(u'OnePlus X (Onyx, 16GB)', u'14,999.00', u'9,585', u'http://www.amazon.in/OnePlus-E1003-X-Onyx-16GB/dp/B016UPKCGU')
(u'Moto G Plus, 4th Gen (White, 16 GB)', u'13,499.00', u'9,819', u'http://www.amazon.in/Moto-Plus-4th-Gen-White/dp/B01DDP85KU')
(u'Coolpad Note 3 (Black, 16GB)', u'8,499.00', u'10,223', u'http://www.amazon.in/Coolpad-Note-3-Black-16GB/dp/B0158IT7ES')
(u'Intex Aqua Speed HD (White-Champagne, 8GB)', u'4,190.00', u'174', u'http://www.amazon.in/Intex-Aqua-Speed-HD-White-Champagne/dp/B01FD7QTEK')
(u'Asus Zenfone Max ZC550KL-6A068IN (Black, 2GB, 16GB)', u'8,999.00', u'810', u'http://www.amazon.in/Asus-Zenfone-ZC550KL-6A068IN-Black-16GB/dp/B018VKZPG4')
(u'Coolpad Note 3 Plus (Champagne-White)', u'8,999.00', u'670', u'http://www.amazon.in/Coolpad-Note-3-Plus-Champagne-White/dp/B01DDP7V7S')
(u'Redmi 2 (White)', u'5,999.00', u'5,215', u'http://www.amazon.in/Mi-Redmi-2-White/dp/B00VEB055E')
(u'Lenovo Vibe X3 (White, 32GB)', u'19,999.00', u'1,776', u'http://www.amazon.in/Lenovo-Vibe-X3-White-32GB/dp/B01AY3H9QA')
(u'XOLO Era X (Black)', u'10,000.00', u'462', u'http://www.amazon.in/XOLO-ERA-X-Era-Black/dp/B01BWL1A0O')
(u'Coolpad Note 3 Plus (Gold)', u'8,999.00', u'671', u'http://www.amazon.in/Coolpad-Note-3-Plus-Gold/dp/B01DDP7DK8')
(u'HTC Desire 620G (Santroni White)', u'8,699.00', u'1,147', u'http://www.amazon.in/HTC-Desire-620G-Santroni-White/dp/B00R7FPSDU')
(u'Samsung Tizen Z3 (Silver)', u'5,590.00', u'9', u'http://www.amazon.in/Samsung-Tizen-Z3-Silver/dp/B01CXXJ8UY')

总是尝试使用类名,属性等来查找您的内容,如果您想测试xpath,请在不在浏览器中的scrapy shell中进行测试。