我已经通过ElementTree解析器将xml文件(Rhythmbox的数据库文件)加载到Python 3中。修改树并使用ascii编码将其写入磁盘(ElementTree.write()
)后,将十六进制代码点中的所有ASCII十六进制字符转换为ASCII十进制代码点。例如,这里是包含版权符号的差异:
< <copyright>© WNYC</copyright>
---
> <copyright>© WNYC</copyright>
有没有办法告诉Python / ElementTree不要这样做?我希望所有十六进制代码保持十六进制代码点。
答案 0 :(得分:1)
我找到了解决方案。首先,我创建了一个新的编解码器错误处理程序,然后猴子修补了ElementTree._get_writer()以使用新的错误处理程序。看起来像:
from xml.etree import ElementTree
import io
import contextlib
import codecs
def lower_first(s):
return s[:1].lower() + s[1:] if s else ''
def html_replace(exc):
if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
s = []
for c in exc.object[exc.start:exc.end]:
s.append('&#%s;' % lower_first(hex(ord(c))[1:].upper()))
return ''.join(s), exc.end
else:
raise TypeError("can't handle %s" % exc.__name__)
codecs.register_error('html_replace', html_replace)
# monkey patch this python function to prevent it from using xmlcharrefreplace
@contextlib.contextmanager
def _get_writer(file_or_filename, encoding):
# returns text write method and release all resources after using
try:
write = file_or_filename.write
except AttributeError:
# file_or_filename is a file name
if encoding == "unicode":
file = open(file_or_filename, "w")
else:
file = open(file_or_filename, "w", encoding=encoding,
errors="html_replace")
with file:
yield file.write
else:
# file_or_filename is a file-like object
# encoding determines if it is a text or binary writer
if encoding == "unicode":
# use a text writer as is
yield write
else:
# wrap a binary writer with TextIOWrapper
with contextlib.ExitStack() as stack:
if isinstance(file_or_filename, io.BufferedIOBase):
file = file_or_filename
elif isinstance(file_or_filename, io.RawIOBase):
file = io.BufferedWriter(file_or_filename)
# Keep the original file open when the BufferedWriter is
# destroyed
stack.callback(file.detach)
else:
# This is to handle passed objects that aren't in the
# IOBase hierarchy, but just have a write method
file = io.BufferedIOBase()
file.writable = lambda: True
file.write = write
try:
# TextIOWrapper uses this methods to determine
# if BOM (for UTF-16, etc) should be added
file.seekable = file_or_filename.seekable
file.tell = file_or_filename.tell
except AttributeError:
pass
file = io.TextIOWrapper(file,
encoding=encoding,
errors='html_replace',
newline="\n")
# Keep the original file open when the TextIOWrapper is
# destroyed
stack.callback(file.detach)
yield file.write
ElementTree._get_writer = _get_writer