我的scrapy蜘蛛中有以下代码:
def parse(self, response):
jsonresponse = json.loads(response.body_as_unicode())
htmldata = jsonresponse["html"]
for sel in htmldata.xpath('//li/li'):
-- more xpath codes --
yield item
但我有这个错误:
raise ValueError("No JSON object could be decoded")
exceptions.ValueError: No JSON object could be decoded
在检查了json回复后,我发现导致此错误的**<!--WPJM-->**
和**<!--WPJM_END-->**
。
<!--WPJM-->{"found_jobs":true,"html":"<html code>","max_num_pages":3}<!--WPJM_END-->
我怎么解析我的scrapy而不看! - WPJM--和! - WPJM_END--代码?
编辑:这是我的错误:
文件&#34; /home/muhammad/Projects/project/project/spiders/crawler.py",第150行,解析 for sel in htmldata.xpath(&#39; // li&#39;): exceptions.AttributeError:&#39; unicode&#39;对象没有属性&#39; xpath&#39;
def parse(self, response):
rawdata = response.body_as_unicode()
jsondata = rawdata.replace('<!--WPJM-->', '').replace('<!--WPJM_END-->', '')
# print jsondata # For debugging
# pass
data = json.loads(jsondata)
htmldata = data["html"]
# print htmldata # For debugging
# pass
for sel in htmldata.xpath('//li'):
item = ProjectjomkerjaItem()
item['title'] = sel.xpath('a/div[@class="position"]/div[@id="job-title-job-listing"]/strong/text()').extract()
item['company'] = sel.xpath('a/div[@class="position"]/div[@class="company"]/strong/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
答案 0 :(得分:0)
最简单的方法是使用replace()
手动删除评论标记:
data = response.body_as_unicode()
data = data.replace('<!--WPJM-->', '').replace('<!--WPJM_END-->', '')
jsonresponse = json.loads(data)
虽然它不是那么pythonic和可靠。
或者,更好的选择是通过xpath获取text()
:
$ scrapy shell index.html
>>> response.xpath('//text()').extract()[0]
u'{"found_jobs":true,"html":"<html code"}'