如何使用Python从HTML数据中提取JSON数据?

时间:2019-06-25 08:34:07

标签: python html json regex

我正在尝试制作一个可以在Outlook Email中读取JSON数据的python脚本,但是问题是如何从HTML数据中提取JSON数据。这是我要提取的原始JSON数据。

{
"vpn_detail":
    {
        "username":"harnishs",  
        "tokens":   
            [
                "85188605",
                "00422786",
            ],
        "cluster_name":"*******.com"
    }

}

所以我已经在Outlook中使用imaplib读取了我的JSON数据,但它以HTML格式读取了JSON数据,因此,该JSON数据被转换为HTML,其读取的Outlook电子邮件是这样的(以HTML形式),

b'<html>\r\n<head>\r\n<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-=\r\n1">\r\n<style type=3D"text/css" style=3D"display:none;"><!-- P {margin-top:0;margi=\r\nn-bottom:0;} --></style>\r\n</head>\r\n<body dir=3D"ltr">\r\n<div id=3D"divtagdefaultwrapper" style=3D"font-size:12pt;color:#000000;font=\r\n-family:Calibri,Helvetica,sans-serif;" dir=3D"ltr">\r\n<p style=3D"margin-top:0;margin-bottom:0"></p>\r\n<div>{<br>\r\n&quot;vpn_detail&quot;:<br>\r\n&nbsp;&nbsp; &nbsp;{<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&quot;username&quot;:&quot;kushpate&q=\r\nuot;,&nbsp;&nbsp; &nbsp;<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&quot;tokens&quot;:&nbsp;&nbsp; &nbsp=\r\n;<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;[<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;85188605&quot;,<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;00422786&quot;,<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;94548619&quot;,<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;51249494&quot;,<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;HHEF0EA5&quot;,<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;2E09A81E&quot;<br>\r\n&nbsp; &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;],<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&quot;cluster_name&quot;:&quot;bgl13-=\r\nvpn-cluster-2.cisco.com&quot;<br>\r\n&nbsp;&nbsp; &nbsp;}<br>\r\n<br>\r\n}</div>\r\n<br>\r\n<p></p>\r\n</div>\r\n</body>\r\n</html>\r\n'

因此,现在从此HTML数据中,我需要相同的JSON文件, 我已经将我的代码制作成这样,

from bs4 import BeautifulSoup
MyStr =""" HTML data """
soup = BeautifulSoup(MyStr, "html.parser")
print(soup.text.strip().replace(" ", ""))

所以这段代码给了我这个结果,

b'


<!--P{margin-top:0;margi=
n-bottom:0;}-->




{
"vpn_detail":
   {
      "username":"harnishs&q;=
uot;,   
      "tokens":   =
;
         [
            =
;"85188605",
            =
;"00422786",
            =
;"94548619",
            =
;
          ],
      "cluster_name":"***********.com"
   }

}





'

但是我希望此数据与输入JSON数据相同,但仍不能精确挖掘。建议我进行任何更改,以便可以通过电子邮件获取相同的JSON数据。

1 个答案:

答案 0 :(得分:3)

您可以使用html2text库极大地简化您的任务,该库几乎可以完成所有工作,您只需要删除不必要的标点符号,并用真实的"替换乱码即可:

import re, json, html2text

MyStr = b'<html>\r\n<head>\r\n<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-=\r\n1">\r\n<style type=3D"text/css" style=3D"display:none;"><!-- P {margin-top:0;margi=\r\nn-bottom:0;} --></style>\r\n</head>\r\n<body dir=3D"ltr">\r\n<div id=3D"divtagdefaultwrapper" dir=3D"ltr" style=3D"font-size: 12pt; colo=\r\nr: rgb(0, 0, 0); font-family: Calibri, Helvetica, sans-serif, &quot;EmojiFo=\r\nnt&quot;, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;, NotoCo=\r\nlorEmoji, &quot;Segoe UI Symbol&quot;, &quot;Android Emoji&quot;, EmojiSymb=\r\nols;">\r\n<p style=3D"margin-top:0; margin-bottom:0"></p>\r\n<div>\r\n<div>{<br>\r\n&quot;vpn_detail&quot;:<br>\r\n&nbsp;&nbsp; &nbsp;{<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&quot;username&quot;:&quot;kushpate&q=\r\nuot;,&nbsp;&nbsp; &nbsp;<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&quot;tokens&quot;:&nbsp;&nbsp; &nbsp=\r\n;<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;[<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;85188605&quot;,<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;00422786&quot;,<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;94548619&quot;,<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;51249494&quot;,<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;HHEF0EA5&quot;,<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;2E09A81E&quot;<br>\r\n&nbsp; &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;],<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&quot;cluster_name&quot;:&quot;bgl13-=\r\nvpn-cluster-2.cisco.com&quot;<br>\r\n&nbsp;&nbsp; &nbsp;}<br>\r\n<br>\r\n}</div>\r\n</div>\r\n<br>\r\n<p></p>\r\n</div>\r\n</body>\r\n</html>\r\n'
MyStrTxt = html2text.html2text(MyStr.decode("utf8"))
clean_string = re.sub(r'(&q;=\s*uot;)|=\s*;\s*', lambda x: '"' if x.group(1) else '', MyStrTxt)
js = json.loads(clean_string)
print(js['vpn_detail']['username']) 
# => 'kushpate'

注意:

  • 您的输入字符串是字节字符串,您需要将其转换为Unicode UTF8字符串,因此,MyStr.decode("utf8")是必需的
  • html2text.html2text(MyStr.decode("utf8"))将清除字符串中的HTML,您将立即获得JSON
  • re.sub(r'(&q;=\s*uot;)|=\s*;\s*', lambda x: '"' if x.group(1) else '', MyStrTxt)删除所有出现的=;,并在其之间留有空格,否则将用真实的&q;=字符替换uot; +零个或多个空格+ "