scrapy无法正确提取标题

时间:2015-01-27 17:48:49

标签: scrapy web-crawler text-extraction

enter image description here在这段代码中,我想要删除链接中的标题,副标题和数据,但是在1和2之外的页面上出现问题,因为只抓取1个项目。我想只提取那些标题为delhivery的条目只有

       import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from delhivery.items import DelhiveryItem




class criticspider(CrawlSpider):
    name = "delh"
    allowed_domains = ["consumercomplaints.in"]
    start_urls = ["http://www.consumercomplaints.in/?search=delhivery&page=2"]


    def parse(self, response):
        sites = response.xpath('//table[@width="100%"]')
        items = []

        for site in sites:
            item = DelhiveryItem()
            item['title'] = site.xpath('.//td[@class="complaint"]/a/span[@style="background-color:yellow"]/text()').extract()[0]
            #item['title'] = site.xpath('.//td[@class="complaint"]/a[text() = "%s Delivery Courier %s"]/text()').extract()[0]
            item['subtitle'] = site.xpath('.//td[@class="compl-text"]/div/b[1]/text()').extract()[0]


            item['date'] = site.xpath('.//td[@class="small"]/text()').extract()[0].strip()
            item['username'] = site.xpath('.//td[@class="small"]/a[2]/text()').extract()[0]
            item['link'] = site.xpath('.//td[@class="complaint"]/a/@href').extract()[0]
            if item['link']:
                if 'http://' not in item['link']:
                    item['link'] = urljoin(response.url, item['link'])
                yield scrapy.Request(item['link'],
                                     meta={'item': item},
                                     callback=self.anchor_page)

            items.append(item)

    def anchor_page(self, response):
        old_item = response.request.meta['item']

        old_item['data'] = response.xpath('.//td[@style="padding-bottom:15px"]/div/text()').extract()[0]


        yield old_item

1 个答案:

答案 0 :(得分:1)

您需要将项目['标题']更改为:

item['title'] = ''.join(site.xpath('//table[@width="100%"]//span[text() = "Delhivery"]/parent::*//text()').extract()[0])

同时编辑网站以仅提取所需的链接(其中包含Delhivery)

sites = response.xpath('//table//span[text()="Delhivery"]/ancestor::div')

编辑: 所以我现在明白你需要为你的代码添加一个分页规则。 它应该是这样的: 您只需添加导入并从项目链接本身编写新的xpath,例如this one

class criticspider(CrawlSpider):
    name = "delh"
    allowed_domains = ["consumercomplaints.in"]
    start_urls = ["http://www.consumercomplaints.in/?search=delhivery"]

    rules = (
        # Extracting pages, allowing only links with page=number to be extracted 
        Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="pagelinks"]', ), allow=('page=\d+', ),unique=True),follow=True),

         # Extract links of items on each page the spider gets from the first rule
        Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="complaint"]', )), callback='parse_item'),
    )

    def parse_item(self, response):
        item = DelhiveryItem()
        #populate item object here the same way you did, this function will be called for each item link.
        #This meand that you'll be extracting data from pages like this one : 
        #http://www.consumercomplaints.in/complaints/delhivery-last-mile-courier-service-poor-delivery-service-c772900.html#c1880509
        item['title'] = response.xpath('<write xpath>').extract()[0]
        item['subtitle'] = response.xpath('<write xpath>').extract()[0]
        item['date'] = response.xpath('<write xpath>').extract()[0].strip()
        item['username'] = response.xpath('<write xpath>').extract()[0]
        item['link'] = response.url
        item['data'] = response.xpath('<write xpath>').extract()[0]
        yield item

另外我建议你写一个xpath,你不使用任何样式参数,尝试使用@class或@id,只使用@ width,@ style或任何样式参数如果它唯一的方法。