如何使用scrapy刮取谷歌播放评论的应用程序?

时间:2015-04-24 09:51:16

标签: python xpath web-scraping google-play scrapy

我写这篇蜘蛛来刮取谷歌播放应用程序的评论。我在这方面取得了部分成功。我只能提取姓名,日期和评论。

我的疑问: 1.如何获得所有评论,因为我只得到41。 2.如何获得div的评级?

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


class CompItem(scrapy.Item):
    rating = scrapy.Field()
    data = scrapy.Field()
    name = scrapy.Field()
    date = scrapy.Field()



class criticspider(CrawlSpider):
    name = "gaana"
    allowed_domains = ["play.google.com"]
    start_urls = ["https://play.google.com/store/apps/details?id=com.gaana&hl=en"]
    # rules = (
        # Rule(
            # SgmlLinkExtractor(allow=('search=jabong&page=1/+',)),
            # callback="parse_start_url",
            # follow=True),
    # )

    def parse(self, response):
        sites = response.xpath('//div[@class="single-review"]')
        items = []

        for site in sites:
            item = CompItem()
            item['data'] = site.xpath('.//div[@class="review-body"]/text()').extract()
            item['name'] = site.xpath('.//div/div/span[@class="author-name"]/a/text()').extract()[0]
            item['date'] = site.xpath('.//span[@class="review-date"]/text()').extract()[0]
            item['rating'] = site.xpath('div[@class="review-info-star-rating"]/aria-label/text()').extract()

            items.append(item)
        return items

2 个答案:

答案 0 :(得分:0)

你有

item['rating'] = site.xpath('.//div[@class="review-info-star-rating"]/aria-label/text()').extract()

不应该是这样的:

private static IEnumerable<int> Items()
{            
    try
    {
        Console.WriteLine("Before 0");

        yield return 0;

        Console.WriteLine("Before 1");

        yield return 1;

        Console.WriteLine("After 1");
    }
    finally 
    {
        Console.WriteLine("Finally");
    }
}

??不知道它是否有效,但请尝试:)

答案 1 :(得分:0)

你可以试试这个:

item['rating'] = site.xpath('.//div[@class="tiny-star star-rating-non-editable-container"]/@aria-label').extract()