Question

我需要处理包含带有罗马尼亚字符的文本（例如“licenţeşimărci”之类的文本）的XML文件。

处理后，此文本最终显示在html文档中。如果在最后一步，我使用类似以下内容输出内容：

doc.write(path, encoding='utf-8', method='xml' )

注意：上面的doc是lxml.etree.ElementTree的实例

我得到类似的东西

<span id="123">licenţe şi mărci</span>

一切看起来都很好（即罗马尼亚字符未转换为html实体）。

如果我确实使用html编码方法

doc.write(path, encoding='utf-8', method='html' )

我得到类似的东西

<span id="128927">licen&#355;e &#351;i m&#259;rci</span>

尽管文档在浏览器中看起来不错，但是文档中的文本还被另一个过程使用，该过程提取文本并将其用于一些差异化工作，这些工作完全被html实体弄糊涂了。

尽管使用xml作为etree.ElementTree.write的方法参数似乎可以解决我当前的问题，但由于我真的想生成html文件，并且在其他情况下可能表现不佳，因此感觉还是不对的。

有没有一种方法可以用html实体编写文档（使用method='html'而不用lxml用html实体替换非ascii字符？

在@mzjn提问之后，事实证明这个故事要复杂得多。

以下是重现该问题的最小代码：

from lxml import etree
from os import path

def process( method, out_name ):

    # the doc
    x = etree.XML('<span id="123">licenţe şi mărci</span>')
    tree = etree.ElementTree(x)

    #do some processing
    modifiedTree = modifyDoc(tree)

    # write the result
    modifiedTree.write(out_name, encoding='utf-8',  method=method)

def modifyDoc(tree):
    root = tree.getroot()

    doNothingTransformPath = _absolutePath('transform.xslt')
    transform_doc = etree.parse(doNothingTransformPath)
    transform = etree.XSLT(transform_doc)
    return transform(root)

def _absolutePath(relative):
    return path.abspath(path.join( __file__ , '..', relative))

if __name__ == '__main__':
    process( 'xml', 'out1.xml')
    process( 'html', 'out2.html')

（在此示例中）XSLT转换（不执行任何操作）为：

-transform.xslt-

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="no"/>
    <xsl:template match="/">
        <xsl:copy-of select="/"/>
    </xsl:template>
</xsl:stylesheet>

如果我不通过XSLT转换传递文档，则输出不包含HTML实体，当通过ModifyDoc函数（即应用xslt转换）传递时，输出如上所述进行更改，（即，使用method='html'时，输出包含HTML实体）。

因此，运行上面的程序将输出两个文件：

-out1.xml-

<span id="123">licenţe şi mărci</span>

和 -out2.html-

<span id="123">licen&#x163;e &#x15F;i m&#x103;rci</span>

编写utf-8文档时如何阻止lxml生成html实体

0 个答案: