这段代码给了我结果,但是输出不是很理想。我的xpath有什么问题?如何通过+10迭代规则。我总是遇到这两个问题。
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
class CompItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
data = scrapy.Field()
name_reviewer = scrapy.Field()
date = scrapy.Field()
model_name = scrapy.Field()
rating = scrapy.Field()
review = scrapy.Field()
class criticspider(CrawlSpider):
name = "flip_review"
allowed_domains = ["flipkart.com"]
start_urls = ['http://www.flipkart.com/samsung-galaxy-s5/product-reviews/ITME5Z9GKXGMFSF6?pid=MOBDUUDTADHVQZXG&type=all']
rules = (
Rule(
SgmlLinkExtractor(allow=('.*\&start=.*',)),
callback="parse_start_url",
follow=True),
)
def parse_start_url(self, response):
sites = response.css('div.review-list div[review-id]')
items = []
model_name = response.xpath('//h1[@class="title"]/text()').re(r'Reviews of (.*?)$')
for site in sites:
item = CompItem()
item['model_name'] = model_name
item['name_reviewer'] = ''.join(site.xpath('.//div[contains(@class, "date")]/preceding-sibling::*[1]//text()').extract())
item['date'] = site.xpath('.//div[contains(@class, "date")]/text()').extract()
item['title'] = site.xpath('.//div[contains(@class,"line fk-font-normal bmargin5 dark-gray")]/strong/text()').extract()
item['review'] = site.xpath('.//span[contains(@class,"review-text")]/text()').extract()
yield item
我的输出是:
{'date': [u'\n 31 Mar 2015 ', u'\n 23 Mar 2015 '],
'model_name': [u'\n Reviews of A & K 333 '],
'name_reviewer': [u'\n pradeep kumar', u'\n vikas agrawal']}
我希望我的输出为:
{model_name :xyz
name_reviewer :abc
date:38383
}
{model_name :xyz
name_reviewer :hfhd
date:9283
}
我认为问题在于我的XPath。
答案 0 :(得分:1)
这应该有帮助,这是您的xpath
,
In [1]: data_list = []
In [2]: sites = response.xpath('//div[@class="review-list"]/div')
In [3]: for site in sites:
data = {}
data['name_reviewer'] = site.xpath('./div/div[@class="line"]/span[@class="fk-color-title fk-font-11 review-username"]/text()|./div/div[@class="line"]/a[@class="load-user-widget fk-underline"]/text()').extract()[0].strip()
data['date'] = site.xpath('./div/div[@class="date line fk-font-small"]/text()').extract()[0].strip()
data['model_name'] = response.xpath('//h1[@class="title"]/text()').extract()[0].strip()
data_list.append(data)
In [4]: data_list
Out[4]:
[{'date': u'10 Apr 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'RISHABH GROVER'},
{'date': u'11 Apr 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'Hemraj Chaudhari'},
{'date': u'28 Apr 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'RISHABH GROVER'},
{'date': u'27 Apr 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'Debadutta Patnaik'},
{'date': u'24 May 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'Joel'},
{'date': u'11 Apr 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'Saswat Nayak'},
{'date': u'14 Apr 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'Amit Thakor'},
{'date': u'28 May 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'Nishchal Sharma'},
{'date': u'13 May 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'siddiq hassan'},
{'date': u'16 May 2014',
'model_name': u'Reviews of Samsung Galaxy S5',
'name_reviewer': u'Raja Shekhar'}]
答案 1 :(得分:1)
首先,您的XPath表达式通常非常脆弱。
您的方法的主要问题是site
不包含审核部分,但它应该包含。换句话说,您不会迭代页面上的审阅块。
此外,模型名称应该在循环之外提取,因为它对于页面上的每个评论都是相同的。我还会使用.re()
从标题中提取模型名称,例如SAMSUNG GALAXY S5
中的REVIEWS OF SAMSUNG GALAXY S5
。
以下是已修复应用的完整工作代码:
def parse_start_url(self, response):
sites = response.css('div.review-list div[review-id]')
model_name = response.xpath('//h1[@class="title"]/text()').re(r'Reviews of (.*?)$')[0].strip()
for site in sites:
item = CompItem()
item['model_name'] = model_name
item['name_reviewer'] = ''.join(site.xpath('.//div[contains(@class, "date")]/preceding-sibling::*[1]//text()').extract()).strip()
item['date'] = site.xpath('.//div[contains(@class, "date")]/text()').extract()[0].strip()
yield item
XPath表达式也变得更简单。为了便于举例,审核部分由CSS选择器div.review-list div[review-id]
标识,该选择器匹配div
review-id
下div
所有review-list
个元素的所有name_reviewer
个元素}。class。
另外,请注意span
的提取方式 - 由于有不同的用户,其中一些用户名表示,有些未注册,位于review-username
line
} class - 我采取了不同的方法:查找审核日期并获取前一个兄弟姐妹的文本。
我想指出像fk-font-small
,fk-font-11
,review-list
等类名是面向布局的类,一般来说,它们不是一个好的选择。依赖你的XPath表达式和CSS选择器。请注意,在答案中使用哪些类来定位元素:title
,date
,@property (nonatomic, strong) NSSet *acceptableContentTypes
- 它们更加面向数据,是您定位器的更好选择。