在Python中编写ElementTree时,如何保留ASCII十六进制代码点?

时间:2017-10-21 17:43:25

标签: python xml encoding ascii

我已经通过ElementTree解析器将xml文件(Rhythmbox的数据库文件)加载到Python 3中。修改树并使用ascii编码将其写入磁盘(ElementTree.write())后,将十六进制代码点中的所有ASCII十六进制字符转换为ASCII十进制代码点。例如,这里是包含版权符号的差异:

<     <copyright>&#xA9; WNYC</copyright>
---
>     <copyright>&#169; WNYC</copyright>

有没有办法告诉Python / ElementTree不要这样做?我希望所有十六进制代码保持十六进制代码点。

1 个答案:

答案 0 :(得分:1)

我找到了解决方案。首先,我创建了一个新的编解码器错误处理程序,然后猴子修补了ElementTree._get_writer()以使用新的错误处理程序。看起来像:

from xml.etree import ElementTree
import io
import contextlib
import codecs


def lower_first(s):
    return s[:1].lower() + s[1:] if s else ''


def html_replace(exc):
    if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
        s = []
        for c in exc.object[exc.start:exc.end]:
            s.append('&#%s;' % lower_first(hex(ord(c))[1:].upper()))
        return ''.join(s), exc.end
    else:
        raise TypeError("can't handle %s" % exc.__name__)

codecs.register_error('html_replace', html_replace)


# monkey patch this python function to prevent it from using xmlcharrefreplace
@contextlib.contextmanager
def _get_writer(file_or_filename, encoding):
    # returns text write method and release all resources after using
    try:
        write = file_or_filename.write
    except AttributeError:
        # file_or_filename is a file name
        if encoding == "unicode":
            file = open(file_or_filename, "w")
        else:
            file = open(file_or_filename, "w", encoding=encoding,
                        errors="html_replace")
        with file:
            yield file.write
    else:
        # file_or_filename is a file-like object
        # encoding determines if it is a text or binary writer
        if encoding == "unicode":
            # use a text writer as is
            yield write
        else:
            # wrap a binary writer with TextIOWrapper
            with contextlib.ExitStack() as stack:
                if isinstance(file_or_filename, io.BufferedIOBase):
                    file = file_or_filename
                elif isinstance(file_or_filename, io.RawIOBase):
                    file = io.BufferedWriter(file_or_filename)
                    # Keep the original file open when the BufferedWriter is
                    # destroyed
                    stack.callback(file.detach)
                else:
                    # This is to handle passed objects that aren't in the
                    # IOBase hierarchy, but just have a write method
                    file = io.BufferedIOBase()
                    file.writable = lambda: True
                    file.write = write
                    try:
                        # TextIOWrapper uses this methods to determine
                        # if BOM (for UTF-16, etc) should be added
                        file.seekable = file_or_filename.seekable
                        file.tell = file_or_filename.tell
                    except AttributeError:
                        pass
                file = io.TextIOWrapper(file,
                                        encoding=encoding,
                                        errors='html_replace',
                                        newline="\n")
                # Keep the original file open when the TextIOWrapper is
                # destroyed
                stack.callback(file.detach)
                yield file.write

ElementTree._get_writer = _get_writer