我有一个类似的JSON文件:
{
"entryLabel": "cat",
"entryContent": "<div class=\"entry_container\"><div class=\"entry lang_en-gb\" id=\"cat_1\"><span class=\"inline\"><h1 class=\"hwd\">cat<\/h1><span> [<\/span><span class=\"pron\" type=\"\">ˈkæt<a href=\"#\" class=\"playback\"><img src=\"https://api.collinsdictionary.com/external/images/redspeaker.gif?version=2013-10-30-1535\" alt=\"Pronunciation for cat\" class=\"sound\" title=\"Pronunciation for cat\" style=\"cursor: pointer\"/><\/a><audio type=\"pronunciation\" title=\"cat\"><source type=\"audio/mpeg\" src=\"https://api.collinsdictionary.com/media/sounds/sounds/0/081/08189/08189.mp3\"/>Your browser does not support HTML5 audio.<\/audio><\/span><span>]<\/span><\/span><div class=\"hom\" id=\"cat_1.1\"><span> <\/span><span class=\"gramGrp\"><span class=\"pos\">noun<\/span><\/span><div class=\"sense\"><span> <\/span><span class=\"bold\">1 <\/span><span class=\"lbl\"><span>(<\/span>domestic<span>)<\/span><\/span><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">chat <em class=\"hi\">m<\/em><\/span><\/span><span class=\"cit\" id=\"cat_1.2\"><span>; <\/span><span class=\"quote\">Have you got a cat?<\/span><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">Est-ce que tu as un chat?<\/span><\/span><\/span><span class=\"re\" id=\"cat_1.3\"><span>; <\/span><span class=\"inline\"><span class=\"orth\">to let the cat out of the bag<\/span><\/span><div class=\"sense\"><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">vendre la mèche<\/span><\/span><\/div><!-- End of DIV sense--><\/span><span class=\"re\" id=\"cat_1.4\"><span>; <\/span><span class=\"inline\"><span class=\"orth\">curiosity killed the cat<\/span><\/span><div class=\"sense\"><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">la curiosité est toujours punie<\/span><\/span><\/div><!-- End of DIV sense--><\/span><span class=\"re\" id=\"cat_1.5\"><span>; <\/span><span class=\"inline\"><span class=\"orth\">to look like sth the cat dragged in<\/span><\/span><span class=\"inline\"><span>, <\/span><span class=\"orth\">to look like sth the cat brought in<\/span><\/span><div class=\"sense\"><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">être dans un état lamentable<\/span><\/span><\/div><!-- End of DIV sense--><\/span><span class=\"re\" id=\"cat_1.6\"><span>; <\/span><span class=\"inline\"><span class=\"orth\">to play cat and mouse with sb<\/span><\/span><span class=\"inline\"><span>, <\/span><span class=\"orth\">to play a game of cat and mouse with sb<\/span><\/span><div class=\"sense\"><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">jouer au chat et à la souris avec qn<\/span><\/span><\/div><!-- End of DIV sense--><\/span><span class=\"re\" id=\"cat_1.7\"><span>; <\/span><span class=\"inline\"><span class=\"orth\">to put the cat among the pigeons<\/span><\/span><span class=\"inline\"><span>, <\/span><span class=\"orth\">to set the cat among the pigeons<\/span><\/span><span class=\"lbl\"><span> (<\/span>British<span>)<\/span><\/span><div class=\"sense\"><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">jeter un pavé dans la mare<\/span><\/span><\/div><!-- End of DIV sense--><\/span><span class=\"re\" id=\"cat_1.8\"><span>; <\/span><span class=\"inline\"><span class=\"orth\">there's no room to swing a cat<\/span><\/span><div class=\"sense\"><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">on ne peut pas se tourner<\/span><\/span><\/div><!-- End of DIV sense--><\/span><\/div><!-- End of DIV sense--><div class=\"sense\"><span> <br/><\/span><span class=\"bold\">2 <\/span><span class=\"lbl\"><span>(= <\/span>big cat<span>)<\/span><\/span><span> <\/span><span class=\"cit lang_fr\"><span class=\"quote\">félin <em class=\"hi\">m<\/em><\/span><\/span><span class=\"cit\" id=\"cat_1.9\"><span>; <\/span><\/span><\/div><!-- End of DIV sense--><\/div><!-- End of DIV hom--><\/div><!-- End of DIV entry lang_en-gb--><\/div><!-- End of DIV entry_container-->\n"
}
我需要解析此JSON文件,但对于数据"entryContent"
,该值是HTML字符串。我可以转换我的初始JSON文件的结构或直接解析HTML字符串?我需要一些建议。
现在我只有这段代码:
import json
from pprint import pprint
json_data=open('cat.json')
data = json.load(json_data)
#pprint(data)
print data["dictionaryCode"]
print data["entryLabel"]
print data["entryContent"]
json_data.close()
最后从HTML中我需要获得此范围<span class="pron" type="">ˈkæt</span>
的值;源元素<source type="audio/mpeg" src="https://api.collinsdictionary.com/media/sounds/sounds/0/081/08189/08189.mp3"/>
的src值;
span类pos <span class="gramGrp"><span class="pos">noun</span></span>
的值;
以及div元素提供的所有感官
<div class="sense">
<span> <br/></span>
<span class="bold">2 </span><span class="lbl"><span>(= </span>big cat<span>)</span></span><span> </span><span class="cit lang_fr"><span class="quote">félin <em class="hi">m</em></span></span><span class="cit" id="cat_1.9"><span>; </span></span>
</div>
答案 0 :(得分:2)
尝试使用BeautifulSoup:
import json
from bs4 import BeautifulSoup
# json_data=open('cat.json')
# data = json.load(json_data)
# using json.load and the 'with' context (to close file when not needed...)
with open('cat.json') as f:
json_data = json.load(f)
print data["dictionaryCode"]
print data["entryLabel"]
entryContentHTML = BeautifulSoup(data["entryContent"])
print entryContentHTML.prettify()
# json_data.close()