我正在Python 3中使用lxml解析一个大型XML文件,该文件具有HTML字符代码(例如[
和]
)。
这是该问题的一个示例,也是我尝试将其标记为重复时尝试使用html.unescape()
的一个示例。我仍在努力使这项工作。以下方法有效,但速度很慢,而且似乎很hacky:
from io import StringIO, BytesIO
from lxml import etree
import html
import re
s = b"""<?xml version="1.0" encoding="UTF-8"?><tag>[0001]</tag>"""
def unescape(s):
# According to this: http://xml.silmaril.ie/specials.html
# There are only 4 special characters for XML. Handle them separately.
#
# This site shows this other codes.
# https://www.dvteclipse.com/documentation/svlinter/How_to_use_special_characters_in_XML.3F.html
#
# Use temporary text that isn't likely to be in data.
tmptxt = {b'&': ((b'&', b'&', b'&', ), b'zZh7001HdahHq'),
b'<': ((b'<', b'<', b'<', ), b'zZh7002HdahHq'),
b'>': ((b'>', b'>', b'>',), b'zZh7002HdahHq'),
b''': ((b''', b''', b'&apos',), b'zZh7003HdahHq')}
# Replace XML special chars with tmptxt
for k, v in tmptxt.items():
for search in v[0]:
s = s.replace(search, v[1])
# Use html.unescape
s = html.unescape(s.decode()).encode()
# replace tmptxt with the allowed XML special chars.
for k, v in tmptxt.items():
s = s.replace(v[1], k)
# Get rid of any other codes and hope for the best
regex = re.compile(rb'&[^\t\n\f <&#;]{1,32};')
s = regex.sub(b'', s)
return s
tree = etree.fromstring(unescape(s))
print(etree.tostring(tree))
似乎可行的第二种方法是tree = etree.fromstring(s, parser=etree.XMLParser(recover=True))
。这似乎也很慢,但显然要干净得多。