Question

我正在尝试获取在检查页面源中的特定URL时看到的JSON数据。该URL有多个标签，但是这些标签中只有一个具有JSON格式的数据。

这是我当前的实现：

import urllib2 
from bs4 import BeautifulSoup
import re
import json

url = "https://www.exampleURL.com"

page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
scripts = soup.find_all('script')

for script in scripts:
    try:
        data = json.loads(script)
        print("Success")
    except Exception:
        print("Not Successful")

此实现无法打印成功。我想要的JSON数据采用以下格式，但是只有一个脚本标签具有JSON数据，所有其他脚本标签与我无关。

<script>
    __DATA__ = {........};
</script>

Answer 1

在尝试将<script>的内容解析为json之前，您需要进行一些数据处理。特别是，您需要删除JavaScript字典前面的__DATA__ =部分。

请记住以下几点：

JavaScript词典不一定是JSON Blob。特别是

示例

{hello: 2}   # Correct JavaScript, incorrect JSON - missing quotes around key
{'hello': 2} # Correct JavaScript, incorrect JSON - Quotes must be double quotes

{"hello": 2} # Correct JSON and JavaScript

几件事可能有助于调试

for script in scripts:
    try:
        print(script) # See what you try to load
        data = json.loads(script)
        print("Success")
    except Exception as e:
        print("Not Successful because {}".format(e)) # Print additional information

如何使用Python在脚本中获取JSON数据

1 个答案: