我无法从统计数据到JSON进行抓取

时间:2019-07-10 15:59:39

标签: python html json web-scraping scripting

我正试图从https://understat.com/league/EPL获取信息。

我试图阅读并看到其他人做了什么,但我只是无法将最后的拼图拼在一起。我设法进行解码,但是我无法以jsonObject的形式获取它。有些人有身份证

import requests
import json
import pandas as pd
import time
import lxml.html as lh
import codecs
from bs4 import BeautifulSoup

import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

url = "https://understat.com/league/EPL"
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')

scripts = soup.find_all('script')

for script in scripts:
    if 'var' in script.text:



        encoded_string = script.text
        encoded_string  = encoded_string .split("JSON.parse('", 1)
        encoded_string = encoded_string.rsplit("'),",1)[0]


        jsonStr = codecs.getdecoder('unicode-escape')(encoded_string)[0]
        jsonObj = json.loads(jsonStr)
        print(jsonObj)
  

从None提高JSONDecodeError(“期望值”,s,err.value)   json.decoder.JSONDecodeError:预期值:第2行第4列(字符   4)

这是一些数据 jsonString 数据:

{"id":"9197","isResult":true,"h":{"id":"89","title":"Manchester United","short_title":"MUN"},"a":{"id":"75","title":"Leicester","short_title":"LEI"},"goals":{"h":"2","a":"1"},"xG":{"h":"1.5137","a":"1.73813"},"datetime":"2018-08-10 22:00:00","forecast":{"w":"0.2812","d":"0.3275","l":"0.3913"}},{"id":"9198","isResult":true,"h":{"id":"86","title":"Newcastle United","short_title":"NEW"},"a":{"id":"82","title":"Tottenham","short_title":"TOT"},"goals":{"h":"1","a":"2"},"xG":{"h":"0.974497","a":"2.58097"},"datetime":"2018-08-11 14:30:00","forecast":{"w":"0.08","d":"0.1479","l":"0.7721"}},{"id":"9199","isResult":true,"h":{"id":"90","title":"Watford","short_title":"WAT"},"a":{"id":"220","title":"Brighton","short_title":"BRI"},"goals":{"h":"2","a":"0"},"xG":{"h":"1.42372","a":"0.45504"},"datetime":"2018-08-11 17:00:00","forecast":{"w":"0.6438","d":"0.2574","l":"0.0988"}},

1 个答案:

答案 0 :(得分:0)

尝试使用以下不同的正则表达式和子字符串

import requests
import re
import json
import codecs

r = requests.get('https://understat.com/league/EPL')
p = re.compile(r'JSON.parse\((.*)\);')
d = p.findall(r.text)[0]
json_str = codecs.getdecoder('unicode-escape')(d)[0]
data = json.loads(json_str[1:-1])

打印(数据)输出示例

enter image description here