使用scrapy处理JSON响应

时间:2014-12-12 14:41:19

标签: json web-scraping scrapy

我的scrapy蜘蛛中有以下代码:

def parse(self, response):
         jsonresponse = json.loads(response.body_as_unicode())
         htmldata = jsonresponse["html"]
         for sel in htmldata.xpath('//li/li'):
                 -- more xpath codes --
         yield item

但我有这个错误:

    raise ValueError("No JSON object could be decoded")
exceptions.ValueError: No JSON object could be decoded

在检查了json回复后,我发现导致此错误的**<!--WPJM-->****<!--WPJM_END-->**

<!--WPJM-->{"found_jobs":true,"html":"<html code>","max_num_pages":3}<!--WPJM_END-->

我怎么解析我的scrapy而不看! - WPJM--和! - WPJM_END--代码?

编辑:这是我的错误:

文件&#34; /home/muhammad/Projects/project/project/spiders/crawler.py",第150行,解析     for sel in htmldata.xpath(&#39; // li&#39;): exceptions.AttributeError:&#39; unicode&#39;对象没有属性&#39; xpath&#39;

    def parse(self, response):
        rawdata = response.body_as_unicode()
        jsondata = rawdata.replace('<!--WPJM-->', '').replace('<!--WPJM_END-->', '')
#       print jsondata # For debugging
#       pass 
        data = json.loads(jsondata)
        htmldata = data["html"]
#       print htmldata # For debugging
#       pass
        for sel in htmldata.xpath('//li'):
           item = ProjectjomkerjaItem()
           item['title'] = sel.xpath('a/div[@class="position"]/div[@id="job-title-job-listing"]/strong/text()').extract()
           item['company'] = sel.xpath('a/div[@class="position"]/div[@class="company"]/strong/text()').extract()
           item['link'] = sel.xpath('a/@href').extract()

1 个答案:

答案 0 :(得分:0)

最简单的方法是使用replace()手动删除评论标记:

data = response.body_as_unicode()
data = data.replace('<!--WPJM-->', '').replace('<!--WPJM_END-->', '')
jsonresponse = json.loads(data)

虽然它不是那么pythonic和可靠。

或者,更好的选择是通过xpath获取text()

$ scrapy shell index.html
>>> response.xpath('//text()').extract()[0]
u'{"found_jobs":true,"html":"<html code"}'