如何使用lxml中的HTML字符代码处理格式错误的XML

时间:2018-10-24 14:27:24

标签: python lxml

我正在Python 3中使用lxml解析一个大型XML文件,该文件具有HTML字符代码(例如[])。

这是该问题的一个示例,也是我尝试将其标记为重复时尝试使用html.unescape()的一个示例。我仍在努力使这项工作。以下方法有效,但速度很慢,而且似乎很hacky:

from io import StringIO, BytesIO
from lxml import etree
import html
import re

s = b"""<?xml version="1.0" encoding="UTF-8"?><tag>&lsqb;0001&rsqb;</tag>"""

def unescape(s):
    # According to this: http://xml.silmaril.ie/specials.html
    # There are only 4 special characters for XML.  Handle them separately.
    #
    # This site shows this other codes.
    # https://www.dvteclipse.com/documentation/svlinter/How_to_use_special_characters_in_XML.3F.html
    #
    # Use temporary text that isn't likely to be in data.
    tmptxt = {b'&amp;':  ((b'&#x26;', b'&amp;', b'&#38', ), b'zZh7001HdahHq'),
              b'&lt;':   ((b'&#x3c;', b'&#60;', b'&lt', ),  b'zZh7002HdahHq'),
              b'&gt;':   ((b'&#x3e;', b'&#62;', b'&gt',),   b'zZh7002HdahHq'),
              b'&apos;': ((b'&#x27;', b'&#39;', b'&apos',), b'zZh7003HdahHq')}

    # Replace XML special chars with tmptxt
    for k, v in tmptxt.items():
        for search in v[0]:
            s = s.replace(search, v[1])

    # Use html.unescape
    s = html.unescape(s.decode()).encode()

    # replace tmptxt with the allowed XML special chars.
    for k, v in tmptxt.items():
        s = s.replace(v[1], k)

    # Get rid of any other codes and hope for the best
    regex = re.compile(rb'&[^\t\n\f <&#;]{1,32};')
    s = regex.sub(b'', s)

    return s


tree = etree.fromstring(unescape(s))

print(etree.tostring(tree))

似乎可行的第二种方法是tree = etree.fromstring(s, parser=etree.XMLParser(recover=True))。这似乎也很慢,但显然要干净得多。

0 个答案:

没有答案