我想从网站上抓取数据:http://www.consumercomplaints.in/?search=ecom-express# 我希望我的请求对于那些经验丰富的Scrapy用户来说非常简单明了。
问题:我正在尝试为每个评论搜索数据。通过数据,**我的意思是主标题,副标题,用户名,日期和评论。 **但我无法得到审查,因为审查我想要的是转到嵌入主标题的链接,然后得到整个评论而不是第一页上的简短评论,并为每个评论做这个。
我的蜘蛛类:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from consumercomplaint.items import ConsumercomplaintItem
class MySpider(BaseSpider):
name = "consumer"
allowed_domains = ["http://www.consumercomplaints.in"]
start_urls = ["http://www.consumercomplaints.in/?search=ecom-express&page=11"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//table[@width="100%"]')
print titles
items = []
del(titles[0])
for i in titles:
item = ConsumercomplaintItem()
item ["maintitle"] = i.select('.//a[1]//text()').extract()
item ["username"] = i.select('.//td[@class="small"]//a[2]/text()').extract()
item["date"]=i.select('.//td[@class="small"]/text()').extract()
item["subtitle"]=i.select('.//td[@class="compl-text"]/div/b[1]/text()').extract()
item["complaint"]=i.select('.//td[@class="compl-text"]/div/text()').extract()
items.append(item)
return items
我的项目类:
from scrapy.item import Item, Field
class ConsumercomplaintItem(Item):
maintitle = Field()
username = Field()
date = Field()
subtitle = Field()
complaint = Field()
答案 0 :(得分:1)
我会分两个阶段进行:
a)提取完整的投诉 b)从meta中提取项目 c)将投诉保存到项目的字段中 d)产量项目
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//table[@width="100%"]')
print titles
items = []
del(titles[0])
for i in titles:
item = ConsumercomplaintItem()
item ["maintitle"] = i.select('.//a[1]//text()').extract()
item ["username"] = i.select('.//td[@class="small"]//a[2]/text()').extract()
item["date"]=i.select('.//td[@class="small"]/text()').extract()
item["subtitle"]=i.select('.//td[@class="compl-text"]/div/b[1]/text()').extract()
complaint_link = row.xpath('//complaint/link/a/@href').extract_first()
complaint_page = response.urljoin(complaint_link)
request = scrapy.Request(cve_page, callback=self.parse_complaint)
request.meta['item'] = item
yield request
def parse_complaint(self, response):
item = response.meta['item']
item['complaint'] = response.xpath('/complaint/path/text()').extract_first()
yield item