Scrapy没有给出手机所有评论的个别结果?

时间:2015-06-12 06:29:39

标签: python xpath web-scraping scrapy scrapy-spider

这段代码给了我结果,但是输出不是很理想。我的xpath有什么问题?如何通过+10迭代规则。我总是遇到这两个问题。

    import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


class CompItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    data = scrapy.Field()
    name_reviewer = scrapy.Field()
    date = scrapy.Field()
    model_name = scrapy.Field()
    rating = scrapy.Field()
    review = scrapy.Field()



class criticspider(CrawlSpider):
    name = "flip_review"
    allowed_domains = ["flipkart.com"]

    start_urls = ['http://www.flipkart.com/samsung-galaxy-s5/product-reviews/ITME5Z9GKXGMFSF6?pid=MOBDUUDTADHVQZXG&type=all']
    rules = (
        Rule(
            SgmlLinkExtractor(allow=('.*\&start=.*',)),
            callback="parse_start_url",
            follow=True),
    )

    def parse_start_url(self, response):
        sites = response.css('div.review-list div[review-id]')
        items = []
        model_name = response.xpath('//h1[@class="title"]/text()').re(r'Reviews of (.*?)$')
        for site in sites:
            item = CompItem()
            item['model_name'] = model_name
            item['name_reviewer'] = ''.join(site.xpath('.//div[contains(@class, "date")]/preceding-sibling::*[1]//text()').extract())
            item['date'] = site.xpath('.//div[contains(@class, "date")]/text()').extract()
            item['title'] = site.xpath('.//div[contains(@class,"line fk-font-normal bmargin5 dark-gray")]/strong/text()').extract()
            item['review'] = site.xpath('.//span[contains(@class,"review-text")]/text()').extract()
            yield item

我的输出是:

 {'date': [u'\n 31 Mar 2015 ', u'\n 23 Mar 2015 '],
  'model_name': [u'\n Reviews of A & K 333 '],
  'name_reviewer': [u'\n pradeep kumar', u'\n vikas agrawal']}

我希望我的输出为:

{model_name :xyz
name_reviewer :abc
date:38383
}
{model_name :xyz
name_reviewer :hfhd
date:9283
}

我认为问题在于我的XPath。

2 个答案:

答案 0 :(得分:1)

这应该有帮助,这是您的xpath

的问题
In [1]: data_list = []

In [2]: sites = response.xpath('//div[@class="review-list"]/div')

In [3]: for site in sites:
    data = {}
    data['name_reviewer'] = site.xpath('./div/div[@class="line"]/span[@class="fk-color-title fk-font-11 review-username"]/text()|./div/div[@class="line"]/a[@class="load-user-widget fk-underline"]/text()').extract()[0].strip()
    data['date'] = site.xpath('./div/div[@class="date line fk-font-small"]/text()').extract()[0].strip()
    data['model_name'] =  response.xpath('//h1[@class="title"]/text()').extract()[0].strip()
    data_list.append(data)


In [4]: data_list
Out[4]: 
[{'date': u'10 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'RISHABH GROVER'},
 {'date': u'11 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Hemraj Chaudhari'},
 {'date': u'28 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'RISHABH GROVER'},
 {'date': u'27 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Debadutta Patnaik'},
 {'date': u'24 May 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Joel'},
 {'date': u'11 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Saswat Nayak'},
 {'date': u'14 Apr 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Amit Thakor'},
 {'date': u'28 May 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Nishchal Sharma'},
 {'date': u'13 May 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'siddiq hassan'},
 {'date': u'16 May 2014',
  'model_name': u'Reviews of Samsung Galaxy S5',
  'name_reviewer': u'Raja Shekhar'}]

答案 1 :(得分:1)

首先,您的XPath表达式通常非常脆弱

您的方法的主要问题是site不包含审核部分,但它应该包含。换句话说,您不会迭代页面上的审阅块。

此外,模型名称应该在循环之外提取,因为它对于页面上的每个评论都是相同的。我还会使用.re()从标题中提取模型名称,例如SAMSUNG GALAXY S5中的REVIEWS OF SAMSUNG GALAXY S5

以下是已修复应用的完整工作代码:

def parse_start_url(self, response):
    sites = response.css('div.review-list div[review-id]')

    model_name = response.xpath('//h1[@class="title"]/text()').re(r'Reviews of (.*?)$')[0].strip()
    for site in sites:
        item = CompItem()
        item['model_name'] = model_name
        item['name_reviewer'] = ''.join(site.xpath('.//div[contains(@class, "date")]/preceding-sibling::*[1]//text()').extract()).strip()
        item['date'] = site.xpath('.//div[contains(@class, "date")]/text()').extract()[0].strip()
        yield item

XPath表达式也变得更简单。为了便于举例,审核部分由CSS选择器div.review-list div[review-id]标识,该选择器匹配div review-iddiv所有review-list个元素的所有name_reviewer个元素}。class。

另外,请注意span的提取方式 - 由于有不同的用户,其中一些用户名表示,有些未注册,位于review-username line } class - 我采取了不同的方法:查找审核日期并获取前一个兄弟姐妹的文本。

我想指出像fk-font-smallfk-font-11review-list等类名是面向布局的类,一般来说,它们不是一个好的选择。依赖你的XPath表达式和CSS选择器。请注意,在答案中使用哪些类来定位元素:titledate@property (nonatomic, strong) NSSet *acceptableContentTypes - 它们更加面向数据,是您定位器的更好选择。