我仍然试图绕过json.loads和json.dump来从网页中提取我想要的东西。我正在处理来自此link的一些数据,其格式为:
data:{
url: 'stage-player-stat'
},
defaultParams: {
stageId: 9155,
teamId: 32,
playerId: -1,
field: 2
},
我使用的代码是:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
import requests
class ExampleSpider(CrawlSpider):
name = "goal2"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/Teams/32/"]
rules = [Rule(SgmlLinkExtractor(allow=('\Teams'),deny=(),), follow=False, callback='parse_item')]
def parse_item(self, response):
stagematch = re.compile("data:\s*{\s*url:\s*'stage-player-stat'\s*},\s*defaultParams:\s*{\s*(.*?),.*},",re.S)
stagematch2 = re.search(stagematch, response.body)
if stagematch2 is not None:
stagematch3 = stagematch2.group(1)
stageid = json.dumps(stagematch3)
print "stageid = ", stageid
execute(['scrapy','crawl','goal2'])
在此示例中,stageId
解析为"stageId: 9155"
。我希望它解决的问题是9155
。我试图用stageId
解析stageid = stageid[0]
,好像它是一本字典,但这不起作用。我做错了什么?
由于
答案 0 :(得分:2)
使用js2xml的解决方案:
<script>
内容var defaultTeamPlayerStatsConfigParams
并获取它的初始值object
js2xml.jsonlike.make_dict()
从中获取Python dict
这是怎么回事,在这个scrapy shell会话中说明了:
$ scrapy shell http://www.whoscored.com/Teams/32/
2014-09-08 11:17:31+0200 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
...
2014-09-08 11:17:32+0200 [default] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/32/> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f88f0605990>
[s] item {}
[s] request <GET http://www.whoscored.com/Teams/32/>
[s] response <200 http://www.whoscored.com/Teams/32/>
[s] settings <scrapy.settings.Settings object at 0x7f88f6046450>
[s] spider <Spider 'default' at 0x7f88efdaff50>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]: import pprint
In [2]: import js2xml
In [3]: for script in response.xpath('//script/text()').extract():
jstree = js2xml.parse(script)
params = jstree.xpath('//var[@name="defaultTeamPlayerStatsConfigParams"]/object')
if params:
pprint.pprint(js2xml.jsonlike.make_dict(params[0]))
...:
{'data': {'url': 'stage-player-stat'},
'defaultParams': {'field': 2, 'playerId': -1, 'stageId': 9155, 'teamId': 32},
'fitText': {'container': '.grid .team-link, .grid .player-link',
'options': {'width': 150}},
'fixZeros': True}
In [4]: for script in response.xpath('//script/text()').extract():
jstree = js2xml.parse(script)
params = jstree.xpath('//var[@name="defaultTeamPlayerStatsConfigParams"]/object')
if params:
params = js2xml.jsonlike.make_dict(params[0])
...: print params["defaultParams"]["stageId"]
...:
9155
In [5]:
答案 1 :(得分:1)
stagematch3 = stagematch2.group(1)
stageid = int(stagematch3.split(':', 1)[1])
如果你愿意,你可以将它转换回str:
stageid = str(stageid)
还有很多其他方法可以解决您的问题。其中一个是使用更简单的正则表达式,然后使用json.loads
解析匹配对象。