我正在尝试解析名为matchCentreData
的项目,该项目可以在以下页面的源代码中找到:
由于此页面上没有涉及XHR请求,并且数据项隐藏在页面源代码本身中,因此我不确定如何使用除正则表达式之外的任何内容来解析此项目。
因为数据结构是深层嵌套的,所以我试图将其分解为几个子组件来单独解析。这是我的代码,尝试解析第一个子组件playerIdNameDictionary
:
import json
import simplejson
import requests
import jsonobject
import time
import re
url = 'http://www.whoscored.com/Matches/829726/Live/England-Premier-League-2014-2015-Stoke-Manchester-United'
params = {}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Host': 'www.whoscored.com',
'Referer': 'http://www.whoscored.com/'}
responser = requests.get(url, params=params, headers=headers)
regex = re.compile("matchCentreData = \{.*?\};", re.S)
match = re.search(regex, responser.text)
match2 = match.group()
match3 = match2[u'playerIdNameDictionary']
print match3
然而,这会产生以下错误:
Traceback (most recent call last):
File "C:\Python27\counter.py", line 23, in <module>
match3 = match2[u'playerIdNameDictionary']
TypeError: string indices must be integers
我认为这是因为我返回的项是字符串,而不是JSON对象。我想知道的是:
1)我在上述句子中对问题的诊断是否正确?
2)如何在不使用正则表达式的情况下解析JSON / javascript对象matchCentreData
?
我希望我的问题有道理。
由于
答案 0 :(得分:0)
match2
只是一个字符串,而不是一个json对象。您可以使用match2 = json.loads(match2)
将字符串转换为json对象。请将json.loads
调用包装在try / catch块中以捕获源json中的错误。
有关json.loads()
的更多信息:https://docs.python.org/2/library/json.html
正如我在下面的评论中所说,你的正则表达式有点过于宽松。它会在找到var matchCentreData = { ...
时开始匹配,但它会继续匹配,直到response.text
中最后一个json blob结束。这不是json.loads可以处理的事情。我已将代码更改为:
>>> regex = re.compile("var matchCentreData = (\{.+\});\r\n var matchCentreEventTypeJson", re.S)
>>> match = re.search(regex, response.text)
>>> # now match.groups(1)[0] will contain the match centre data json blob
>>> match_centre_data = json.loads(match.groups(1)[0])
>>> match_centre_data['playerIdNameDictionary']['34693']
'Marko Arnautovic'
请注意,这种编码形式非常脆弱,当whoscores.com更新其网站时,它可能会中断。
答案 1 :(得分:0)
青少年可以使用beautifulsoup来提取剧本:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content)
data_cen = re.compile('var matchCentreData = ({.*?})')
data = soup.find("script",text=data_cen).text
d = json.dumps(data_cen.search(data).group(1))
data_dict = (json.loads(d))
{"playerIdNameDictionary":{"34693":"Marko Arnautovic","23122":"Asmir Begovic","39935":"Steven N'Zonzi","4145":"Robert Huth","3860":"Jonathan Walters","23446":"Marc Wilson","8505":"Glenn Whelan","29762":"Oussama Assaidi","24148":"Erik Pieters","26013":"Mame Biram Diouf","75177":"Marc Muniesa","38772":"Geoff Cameron","107395":"Jack Butland","29798":"Ryan Shawcross","3807":"Peter Crouch","8327":"Charlie Adam","18181":"Phil Bardsley","254558":"Oliver Shenton","130334":"Adnan Januzaj","4092":"Rafael","18701":"Falcao","10620":"Anders Lindegaard","4564":"Robin van Persie","25363":"Juan Mata","71174":"Ander Herrera","79554":"David de Gea","2115":"Michael Carrick","3859":"Wayne Rooney","8166":"Ashley Young","81726":"Phil Jones","118244":"Luke Shaw","137795":"Tyler Blackett","145271":"James Wilson","71345":"Chris Smalling","5835":"Darren Fletcher","22079":"Jonny Evans"}
您还可以使用find_next和类似的正则表达式来查找脚本以提取所需的数据:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content)
data_cen = re.compile('var matchCentreData = ({.*?})')
event_type = re.compile('var matchCentreEventTypeJson = ({.*?})')
data = soup.find("a", href="/ContactUs").find_next("script").text
d = json.dumps(data_cen.search(data).group(1))
e = json.dumps(event_type.search(data).group(1))
data_dict = json.loads(d)
event_dict = json.loads(e)
{"playerIdNameDictionary":{"34693":"Marko Arnautovic","23122":"Asmir Begovic","39935":"Steven N'Zonzi","4145":"Robert Huth","3860":"Jonathan Walters","23446":"Marc Wilson","8505":"Glenn Whelan","29762":"Oussama Assaidi","24148":"Erik Pieters","26013":"Mame Biram Diouf","75177":"Marc Muniesa","38772":"Geoff Cameron","107395":"Jack Butland","29798":"Ryan Shawcross","3807":"Peter Crouch","8327":"Charlie Adam","18181":"Phil Bardsley","254558":"Oliver Shenton","130334":"Adnan Januzaj","4092":"Rafael","18701":"Falcao","10620":"Anders Lindegaard","4564":"Robin van Persie","25363":"Juan Mata","71174":"Ander Herrera","79554":"David de Gea","2115":"Michael Carrick","3859":"Wayne Rooney","8166":"Ashley Young","81726":"Phil Jones","118244":"Luke Shaw","137795":"Tyler Blackett","145271":"James Wilson","71345":"Chris Smalling","5835":"Darren Fletcher","22079":"Jonny Evans"}
{"shotSixYardBox":0,"shotPenaltyArea":1,"shotOboxTotal":2,"shotOpenPlay":3,"shotCounter":4,"shotSetPiece":5,"shotOffTarget":6,"shotOnPost":7,"shotOnTarget":8,"shotsTotal":9,"shotBlocked":10,"shotRightFoot":11,"shotLeftFoot":12,"shotHead":13,"shotObp":14,"goalSixYardBox":15,"goalPenaltyArea":16,"goalObox":17,"goalOpenPlay":18,"goalCounter":19,"goalSetPiece":20,"penaltyScored":21,"goalOwn":22,"goalNormal":23,"goalRightFoot":24,"goalLeftFoot":25,"goalHead":26,"goalObp":27,"shortPassInaccurate":28,"shortPassAccurate":29,"passCorner":30,"passCornerAccurate":31,"passCornerInaccurate":32,"passFreekick":33,"passBack":34,"passForward":35,"passLeft":36,"passRight":37,"keyPassLong":38,"keyPassShort":39,"keyPassCross":40,"keyPassCorner":41,"keyPassThroughball":42,"keyPassFreekick":43,"keyPassThrowin":44,"keyPassOther":45,"assistCross":46,"assistCorner":47,"assistThroughball":48,"assistFreekick":49,"assistThrowin":50,"assistOther":51,"dribbleLost":52,"dribbleWon":53,"challengeLost":54,"interceptionWon":55,"clearanceHead":56,"outfielderBlock":57,"passCrossBlockedDefensive":58,"outfielderBlockedPass":59,"offsideGiven":60,"offsideProvoked":61,"foulGiven":62,"foulCommitted":63,"yellowCard":64,"voidYellowCard":65,"secondYellow":66,"redCard":67,"turnover":68,"dispossessed":69,"saveLowLeft":70,"saveHighLeft":71,"saveLowCentre":72,"saveHighCentre":73,"saveLowRight":74,"saveHighRight":75,"saveHands":76,"saveFeet":77,"saveObp":78,"saveSixYardBox":79,"savePenaltyArea":80,"saveObox":81,"keeperDivingSave":82,"standingSave":83,"closeMissHigh":84,"closeMissHighLeft":85,"closeMissHighRight":86,"closeMissLeft":87,"closeMissRight":88,"shotOffTargetInsideBox":89,"touches":90,"assist":91,"ballRecovery":92,"clearanceEffective":93,"clearanceTotal":94,"clearanceOffTheLine":95,"dribbleLastman":96,"errorLeadsToGoal":97,"errorLeadsToShot":98,"intentionalAssist":99,"interceptionAll":100,"interceptionIntheBox":101,"keeperClaimHighLost":102,"keeperClaimHighWon":103,"keeperClaimLost":104,"keeperClaimWon":105,"keeperOneToOneWon":106,"parriedDanger":107,"parriedSafe":108,"collected":109,"keeperPenaltySaved":110,"keeperSaveInTheBox":111,"keeperSaveTotal":112,"keeperSmother":113,"keeperSweeperLost":114,"keeperMissed":115,"passAccurate":116,"passBackZoneInaccurate":117,"passForwardZoneAccurate":118,"passInaccurate":119,"passAccuracy":120,"cornerAwarded":121,"passKey":122,"passChipped":123,"passCrossAccurate":124,"passCrossInaccurate":125,"passLongBallAccurate":126,"passLongBallInaccurate":127,"passThroughBallAccurate":128,"passThroughBallInaccurate":129,"passThroughBallInacurate":130,"passFreekickAccurate":131,"passFreekickInaccurate":132,"penaltyConceded":133,"penaltyMissed":134,"penaltyWon":135,"passRightFoot":136,"passLeftFoot":137,"passHead":138,"sixYardBlock":139,"tackleLastMan":140,"tackleLost":141,"tackleWon":142,"cleanSheetGK":143,"cleanSheetDL":144,"cleanSheetDC":145,"cleanSheetDR":146,"cleanSheetDML":147,"cleanSheetDMC":148,"cleanSheetDMR":149,"cleanSheetML":150,"cleanSheetMC":151,"cleanSheetMR":152,"cleanSheetAML":153,"cleanSheetAMC":154,"cleanSheetAMR":155,"cleanSheetFWL":156,"cleanSheetFW":157,"cleanSheetFWR":158,"cleanSheetSub":159,"goalConcededByTeamGK":160,"goalConcededByTeamDL":161,"goalConcededByTeamDC":162,"goalConcededByTeamDR":163,"goalConcededByTeamDML":164,"goalConcededByTeamDMC":165,"goalConcededByTeamDMR":166,"goalConcededByTeamML":167,"goalConcededByTeamMC":168,"goalConcededByTeamMR":169,"goalConcededByTeamAML":170,"goalConcededByTeamAMC":171,"goalConcededByTeamAMR":172,"goalConcededByTeamFWL":173,"goalConcededByTeamFW":174,"goalConcededByTeamFWR":175,"goalConcededByTeamSub":176,"goalConcededOutsideBoxGoalkeeper":177,"goalScoredByTeamGK":178,"goalScoredByTeamDL":179,"goalScoredByTeamDC":180,"goalScoredByTeamDR":181,"goalScoredByTeamDML":182,"goalScoredByTeamDMC":183,"goalScoredByTeamDMR":184,"goalScoredByTeamML":185,"goalScoredByTeamMC":186,"goalScoredByTeamMR":187,"goalScoredByTeamAML":188,"goalScoredByTeamAMC":189,"goalScoredByTeamAMR":190,"goalScoredByTeamFWL":191,"goalScoredByTeamFW":192,"goalScoredByTeamFWR":193,"goalScoredByTeamSub":194,"aerialSuccess":195,"duelAerialWon":196,"duelAerialLost":197,"offensiveDuel":198,"defensiveDuel":199,"bigChanceMissed":200,"bigChanceScored":201,"bigChanceCreated":202,"overrun":203,"successfulFinalThirdPasses":204,"punches":205,"penaltyShootoutScored":206,"penaltyShootoutMissedOffTarget":207,"penaltyShootoutSaved":208,"penaltyShootoutSavedGK":209,"penaltyShootoutConcededGK":210,"throwIn":211,"subOn":212,"subOff":213,"defensiveThird":214,"midThird":215,"finalThird":216,"pos":217}
完整代码:
import json
import requests
import re
url = 'http://www.whoscored.com/Matches/829726/Live/England-Premier-League-2014-2015-Stoke-Manchester-United'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Host': 'www.whoscored.com',
'Referer': 'http://www.whoscored.com/'}
r = requests.get(url, headers=headers)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content)
data_cen = re.compile('var matchCentreData = ({.*?})')
event_type = re.compile('var matchCentreEventTypeJson = ({.*?})')
data = soup.find("a", href="/ContactUs").find_next("script").text
d = json.dumps(data_cen.search(data).group(1))
e = json.dumps(event_type.search(data).group(1))
data_dict = json.loads(d)
event_dict = json.loads(e)
print(event_dict)
print(data_dict)
{"shotSixYardBox":0,"shotPenaltyArea":1,"shotOboxTotal":2,"shotOpenPlay":3,"shotCounter":4,"shotSetPiece":5,"shotOffTarget":6,"shotOnPost":7,"shotOnTarget":8,"shotsTotal":9,"shotBlocked":10,"shotRightFoot":11,"shotLeftFoot":12,"shotHead":13,"shotObp":14,"goalSixYardBox":15,"goalPenaltyArea":16,"goalObox":17,"goalOpenPlay":18,"goalCounter":19,"goalSetPiece":20,"penaltyScored":21,"goalOwn":22,"goalNormal":23,"goalRightFoot":24,"goalLeftFoot":25,"goalHead":26,"goalObp":27,"shortPassInaccurate":28,"shortPassAccurate":29,"passCorner":30,"passCornerAccurate":31,"passCornerInaccurate":32,"passFreekick":33,"passBack":34,"passForward":35,"passLeft":36,"passRight":37,"keyPassLong":38,"keyPassShort":39,"keyPassCross":40,"keyPassCorner":41,"keyPassThroughball":42,"keyPassFreekick":43,"keyPassThrowin":44,"keyPassOther":45,"assistCross":46,"assistCorner":47,"assistThroughball":48,"assistFreekick":49,"assistThrowin":50,"assistOther":51,"dribbleLost":52,"dribbleWon":53,"challengeLost":54,"interceptionWon":55,"clearanceHead":56,"outfielderBlock":57,"passCrossBlockedDefensive":58,"outfielderBlockedPass":59,"offsideGiven":60,"offsideProvoked":61,"foulGiven":62,"foulCommitted":63,"yellowCard":64,"voidYellowCard":65,"secondYellow":66,"redCard":67,"turnover":68,"dispossessed":69,"saveLowLeft":70,"saveHighLeft":71,"saveLowCentre":72,"saveHighCentre":73,"saveLowRight":74,"saveHighRight":75,"saveHands":76,"saveFeet":77,"saveObp":78,"saveSixYardBox":79,"savePenaltyArea":80,"saveObox":81,"keeperDivingSave":82,"standingSave":83,"closeMissHigh":84,"closeMissHighLeft":85,"closeMissHighRight":86,"closeMissLeft":87,"closeMissRight":88,"shotOffTargetInsideBox":89,"touches":90,"assist":91,"ballRecovery":92,"clearanceEffective":93,"clearanceTotal":94,"clearanceOffTheLine":95,"dribbleLastman":96,"errorLeadsToGoal":97,"errorLeadsToShot":98,"intentionalAssist":99,"interceptionAll":100,"interceptionIntheBox":101,"keeperClaimHighLost":102,"keeperClaimHighWon":103,"keeperClaimLost":104,"keeperClaimWon":105,"keeperOneToOneWon":106,"parriedDanger":107,"parriedSafe":108,"collected":109,"keeperPenaltySaved":110,"keeperSaveInTheBox":111,"keeperSaveTotal":112,"keeperSmother":113,"keeperSweeperLost":114,"keeperMissed":115,"passAccurate":116,"passBackZoneInaccurate":117,"passForwardZoneAccurate":118,"passInaccurate":119,"passAccuracy":120,"cornerAwarded":121,"passKey":122,"passChipped":123,"passCrossAccurate":124,"passCrossInaccurate":125,"passLongBallAccurate":126,"passLongBallInaccurate":127,"passThroughBallAccurate":128,"passThroughBallInaccurate":129,"passThroughBallInacurate":130,"passFreekickAccurate":131,"passFreekickInaccurate":132,"penaltyConceded":133,"penaltyMissed":134,"penaltyWon":135,"passRightFoot":136,"passLeftFoot":137,"passHead":138,"sixYardBlock":139,"tackleLastMan":140,"tackleLost":141,"tackleWon":142,"cleanSheetGK":143,"cleanSheetDL":144,"cleanSheetDC":145,"cleanSheetDR":146,"cleanSheetDML":147,"cleanSheetDMC":148,"cleanSheetDMR":149,"cleanSheetML":150,"cleanSheetMC":151,"cleanSheetMR":152,"cleanSheetAML":153,"cleanSheetAMC":154,"cleanSheetAMR":155,"cleanSheetFWL":156,"cleanSheetFW":157,"cleanSheetFWR":158,"cleanSheetSub":159,"goalConcededByTeamGK":160,"goalConcededByTeamDL":161,"goalConcededByTeamDC":162,"goalConcededByTeamDR":163,"goalConcededByTeamDML":164,"goalConcededByTeamDMC":165,"goalConcededByTeamDMR":166,"goalConcededByTeamML":167,"goalConcededByTeamMC":168,"goalConcededByTeamMR":169,"goalConcededByTeamAML":170,"goalConcededByTeamAMC":171,"goalConcededByTeamAMR":172,"goalConcededByTeamFWL":173,"goalConcededByTeamFW":174,"goalConcededByTeamFWR":175,"goalConcededByTeamSub":176,"goalConcededOutsideBoxGoalkeeper":177,"goalScoredByTeamGK":178,"goalScoredByTeamDL":179,"goalScoredByTeamDC":180,"goalScoredByTeamDR":181,"goalScoredByTeamDML":182,"goalScoredByTeamDMC":183,"goalScoredByTeamDMR":184,"goalScoredByTeamML":185,"goalScoredByTeamMC":186,"goalScoredByTeamMR":187,"goalScoredByTeamAML":188,"goalScoredByTeamAMC":189,"goalScoredByTeamAMR":190,"goalScoredByTeamFWL":191,"goalScoredByTeamFW":192,"goalScoredByTeamFWR":193,"goalScoredByTeamSub":194,"aerialSuccess":195,"duelAerialWon":196,"duelAerialLost":197,"offensiveDuel":198,"defensiveDuel":199,"bigChanceMissed":200,"bigChanceScored":201,"bigChanceCreated":202,"overrun":203,"successfulFinalThirdPasses":204,"punches":205,"penaltyShootoutScored":206,"penaltyShootoutMissedOffTarget":207,"penaltyShootoutSaved":208,"penaltyShootoutSavedGK":209,"penaltyShootoutConcededGK":210,"throwIn":211,"subOn":212,"subOff":213,"defensiveThird":214,"midThird":215,"finalThird":216,"pos":217}
{"playerIdNameDictionary":{"34693":"Marko Arnautovic","23122":"Asmir Begovic","39935":"Steven N'Zonzi","4145":"Robert Huth","3860":"Jonathan Walters","23446":"Marc Wilson","8505":"Glenn Whelan","29762":"Oussama Assaidi","24148":"Erik Pieters","26013":"Mame Biram Diouf","75177":"Marc Muniesa","38772":"Geoff Cameron","107395":"Jack Butland","29798":"Ryan Shawcross","3807":"Peter Crouch","8327":"Charlie Adam","18181":"Phil Bardsley","254558":"Oliver Shenton","130334":"Adnan Januzaj","4092":"Rafael","18701":"Falcao","10620":"Anders Lindegaard","4564":"Robin van Persie","25363":"Juan Mata","71174":"Ander Herrera","79554":"David de Gea","2115":"Michael Carrick","3859":"Wayne Rooney","8166":"Ashley Young","81726":"Phil Jones","118244":"Luke Shaw","137795":"Tyler Blackett","145271":"James Wilson","71345":"Chris Smalling","5835":"Darren Fletcher","22079":"Jonny Evans"}