我正在研究一个python脚本,该脚本需要一个原始HTML地址,并查找以JSON格式格式化的内容,而忽略了所有HMTL和CSS。下面的示例:
<!DOCTYPE html>
<html>
<body>\
<h1>Example Header</h1>
<p>Example Paragraph.</p>
</body>
</html>
<script id='data-grnhs' type='text/template'>
{"cities":
[{"name":"New York City","numPeople":47},
{"name":"Dublin","numPeople":20},
{"name":"Palo Alto","numPeople":11},
{"name":"Austin","numPeople":10},
{"name":"London","numPeople":5},
{"name":"New Delhi/Gurgaon","numPeople":5},
{"name":"Toronto","numPeople":4},
{"name":"Sydney","numPeople":3},
{"name":"Seattle","numPeople":3}]
}
由此,我想严格地从此内容中获取JSON,因为它不一定总是在相同的位置/以相同的语法列出。
我尝试使用BeautifulSoup来清除不同的HTML标记,例如:
# Clear every script tag
for tag in soup.find_all('script'):
tag.clear()
# Clear every style tag
for tag in soup.find_all('style'):
tag.clear()
# Remove style attributes (if needed)
for tag in soup.find_all(style=True):
del tag['style']
但是它仍然可以为我提供所需的格式化输出。有没有一种方法可以从HTML文件严格转换JSON内容?