Question

我目前正在编写一个脚本，将一堆XML文件从各种编码转换为统一的UTF-8。

我首先尝试使用LXML确定编码：

def get_source_encoding(self):
    tree = etree.parse(self.inputfile)
    encoding = tree.docinfo.encoding
    self.inputfile.seek(0)
    return (encoding or '').lower()

如果空白，我会尝试从chardet

获取

def guess_source_encoding(self):
    chunk = self.inputfile.read(1024 * 10)
    self.inputfile.seek(0)
    return chardet.detect(chunk).lower()

然后我使用codecs转换文件的编码：

def convert_encoding(self, source_encoding, input_filename, output_filename):
    chunk_size = 16 * 1024

    with codecs.open(input_filename, "rb", source_encoding) as source:
        with codecs.open(output_filename, "wb", "utf-8") as destination:
            while True:
                chunk = source.read(chunk_size)

                if not chunk:
                    break;

                destination.write(chunk)

最后，我试图重写XML标头。如果XML标头最初是

<?xml version="1.0"?>

或

<?xml version="1.0" encoding="windows-1255"?>

我想将其转换为

<?xml version="1.0" encoding="UTF-8"?>

我目前的代码似乎不起作用：

def edit_header(self, input_filename):
    output_filename = tempfile.mktemp(suffix=".xml")

    with open(input_filename, "rb") as source:
        parser = etree.XMLParser(encoding="UTF-8")
        tree = etree.parse(source, parser)

        with open(output_filename, "wb") as destination:
            tree.write(destination, encoding="UTF-8")

我目前正在测试的文件有一个没有指定编码的标头。如何使用指定的编码正确输出标题？

Answer 1

尝试：

tree.write(destination, xml_declaration=True, encoding='UTF-8')

来自the API docs：

xml_declaration 控制是否应将XML声明添加到文件中。永久使用False，True始终使用None，仅限于US-ASCII或UTF-8（默认为None）。

来自ipython的示例：

In [15]:  etree.ElementTree(etree.XML('<hi/>')).write(sys.stdout, xml_declaration=True, encoding='UTF-8')
<?xml version='1.0' encoding='UTF-8'?>
<hi/>

经过反思，我觉得你太努力了。 lxml会自动检测编码并根据该编码正确解析文件。

所以你真正要做的事情（至少在Python2.7中）是：

def convert_encoding(self, source_encoding, input_filename, output_filename):
    tree = etree.parse(input_filename)
    with open(output_filename, 'w') as destination:
        tree.write(destination, encoding='utf-8', xml_declaration=True)

使用LXML编写XML标头

1 个答案: