Question

我正在研究一个python脚本，该脚本需要一个原始HTML地址，并查找以JSON格式格式化的内容，而忽略了所有HMTL和CSS。下面的示例：

<!DOCTYPE html>
<html>
<body>\
<h1>Example Header</h1>
<p>Example Paragraph.</p>
</body>
</html>

<script id='data-grnhs' type='text/template'>
    {"cities":
        [{"name":"New York City","numPeople":47}, 
         {"name":"Dublin","numPeople":20},
         {"name":"Palo Alto","numPeople":11}, 
         {"name":"Austin","numPeople":10}, 
         {"name":"London","numPeople":5},
         {"name":"New Delhi/Gurgaon","numPeople":5}, 
         {"name":"Toronto","numPeople":4}, 
         {"name":"Sydney","numPeople":3}, 
         {"name":"Seattle","numPeople":3}]
     }

由此，我想严格地从此内容中获取JSON，因为它不一定总是在相同的位置/以相同的语法列出。

我尝试使用BeautifulSoup来清除不同的HTML标记，例如：

# Clear every script tag
for tag in soup.find_all('script'):
    tag.clear()

# Clear every style tag
for tag in soup.find_all('style'):
    tag.clear()

# Remove style attributes (if needed)
for tag in soup.find_all(style=True):
    del tag['style']

但是它仍然可以为我提供所需的格式化输出。有没有一种方法可以从HTML文件严格转换JSON内容？

如何仅获取HTML请求的JSON部分

0 个答案: