部分以JSON格式的html代码所需的转换

时间:2013-08-08 07:57:02

标签: python html json beautifulsoup

我有一个html代码,其中有一个json字符串由另一个程序生成,整个json字符串在html代码中注释。但是有些重要的信息必须从json中解析出来。 我可以做些什么来将注释的json字符串转换为html格式,以便它成为我可以解析的正确html代码。

这是输入样本。由于字符限制,我删除了一些代码。

<!DOCTYPE html> 
 <!--[if lt IE 7]> <html lang="en" class="ie ie6 lte9 lte8 lte7 os-win"> <![endif]-->
 <!--[if IE 7]> <html lang="en" class="ie ie7 lte9 lte8 lte7 os-win"> <![endif]-->
 <!--[if IE 8]> <html lang="en" class="ie ie8 lte9 lte8 os-win"> <![endif]-->
 <!--[if IE 9]> <html lang="en" class="ie ie9 lte9 os-win"> <![endif]-->
 <!--[if gt IE 9]> <html lang="en" class="os-win"> <![endif]-->
 <!--[if !IE]><!--> <html lang="en" class="os-win"> <!--<![endif]-->

<head>

<meta name="lnkd-track-json-lib" content="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=2jds9coeh4w78ed9wblscv68v-eo3jgzogk6v7maxgg86f4u27d&amp;fc=2">
  <meta name="lnkd-track-lib" content="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=eo3jgzogk6v7maxgg86f4u27d&amp;fc=2"><meta name="treeID" content="yGlqHfV7FxMQvJqjACsAAA==">
  <meta name="appName" content="profile">
<meta name="lnkd-track-error" content="/lite/ua/error?csrfToken=ajax%3A1584468784299534813&amp;goback=%2Enpv_131506997_*1_*1_NAME*4SEARCH_9ikF_*1_en*4US_*1_*1_*1_123452511375704499972_1_63_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1"><script src="http://static.licdn.com:80/scds/common/u/lib/fizzy/fz-1.3.3-min.js" type="text/javascript"></script><script type="text/javascript">fs.config({"failureRedirect":"http://www.linkedin.com/nhome/","uniEscape":true,"xhrHeaders":{"X-FS-Origin-Request":"/profile/view?id=131506997&authType=NAME_SEARCH&authToken=9ikF&locale=en_US&srchid=123452511375704499972&srchindex=1&srchtotal=63&trk=vsrp_people_res_name&trkInfo=VSRPsearchId%3A123452511375704499972%2CVSRPtargetId%3A131506997%2CVSRPcmpt%3Aprimary","X-FS-Page-Id":"nprofile-view"}});</script>
<!--{"content":{"search_highlight":{},"message_exchanged":{"messagesOnlyToViewee":true,"messagesOnlyToViewer":true},"Certifications":{"certsMpr":{},"empty":{}},"lix_treasury_callout":"B","network_overview":{"lix_deferLoad":"B","lix_showDetail":"control","distance":3,"lix_deferOnload":"B","allow_pivot_search":false,"i18n_S_NETWORK":"xyz's Network","facets":{"skill_explicit":{"data":[{"count":5,"name":"Equity Research","value":"2112"},{"count":5,"name":"Equities","value":"462"},{"count":5,"name":"Portfolio Management","value":"480"},{"count":4,"name":"Financial Markets","value":"1371"},{"count":4,"name":"Derivatives","value":"814"}]}} }}}}

我尝试取出json部分并尝试通过

解析它
>>> json1 = json.loads(f1)

Traceback (most recent call last):
  File "<pyshell#26>", line 1, in <module>
    json1 = json.loads(f1)
  File "C:\Python27\lib\json\__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "C:\Python27\lib\json\decoder.py", line 365, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Python27\lib\json\decoder.py", line 383, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

1 个答案:

答案 0 :(得分:1)

您可以使用lambda text:isinstance(text, Comment)解析来自html的注释,然后通过json模块加载json字符串。这是一个例子:

import json
from bs4 import BeautifulSoup, Comment

soup = BeautifulSoup("""
<table>
<tr>
   <td><table><tr><td>1</td></tr><tr><td>2</td></tr></table></td>
</tr>
<!--

{"test": [1,2,3]}

-->
<tr>
   <td><table><tr><td>3</td></tr><tr><td>4</td></tr></table></td>
</tr>
</table>
""")

comments = soup.find(text=lambda text:isinstance(text, Comment))
comments = json.loads(comments)
print comments['test']

打印:

[1,2,3]

希望有所帮助。