Question

我注意到xml实体＆amp; quot 会自动强制转换为真正的原始字符：

>>> from lxml import etree as et
>>> parser = et.XMLParser()
>>> xml = et.fromstring("<root><elem>&quot;hello world&quot;</elem></root>", parser)
>>> print et.tostring(xml, pretty_print=1)
<root>
  <elem>"hello world"</elem>
</root>

>>>

我发现了一个相关的旧版（2009-02-07）thread：

s = cStringIO.StringIO（“”“”她是男人！“”“”）     e = etree.parse（s，etree.XMLParser（resolve_entities = False））


请注意，还有etree.fromstring（）。


etree.tostring（e）中     “她是男人！”'

我原本希望resolve_entities = False阻止了     翻译，例如，“to”。


“resolve_entities”选项适用于DTD中定义的实体   您希望保留引用而不是已解析的值。   您提到的实体是XML规范的一部分，而不是DTD。


还有另一种方法可以防止这种行为（或者，如果没有别的，     在事实之后反转它？？


嗯，你得到的是格式良好的XML。我可以问你为什么需要这个   输出中的实体引用？

但是，回答是您想要这样做的原因，对此问题没有直接的答案。我很惊讶因为etree解析器强制转换而不提供禁用它的选项。

以下示例显示了我需要此解决方案的原因，此xml适用于xbmc skinning解析器：

>>> print open("/tmp/so.xml").read() #the original file
<window id="1234">
        <defaultcontrol>101</defaultcontrol>
        <controls>
                <control type="button" id="101">
                        <onfocus>Dialog.Close(212)</onfocus>
                        <onfocus>SetFocus(11)</onfocus>
                </control>
                <control type="button" id="102">
                        <visible>StringCompare(VideoPlayer.PlotOutline,Stream.IsPlaying) + !Skin.HasSetting(Stream.IsUpdated)</visible>
                        <onfocus>RunScript(script.test)</onfocus>
                        <onfocus>SetFocus(11)</onfocus>
                </control>
                <control type="button" id="103">
                        <visible>SubString(VideoPlayer.PlotOutline,Video.IsPlaying)</visible>
                        <onfocus>Close</onfocus>
                        <onfocus>RunScript(&quot;/.xbmc/addons/script.hello.world/default.py&quot;,&quot;$INFO[VideoPlayer.Album]&quot;,&quot;$INFO[VideoPlayer.Genre]&quot;)</onfocus>
                </control>
        </controls>
</window>

>>> root = et.parse("/tmp/so.xml", parser)
>>> r = root.getroot()
>>> for c in r:
...     for cc in c:
...         if cc.attrib.get('id') == "103":
...             cc.remove(cc[1]) #remove 1 element, it's just a demonstrate
... 
>>> o = open("/tmp/so.xml", "w")
>>> o.write(et.tostring(r, pretty_print=1)) #save it back
>>> o.close()
>>> print open("/tmp/so.xml").read() #the file after implemented 
<window id="1234">
        <defaultcontrol>101</defaultcontrol>
        <controls>
                <control type="button" id="101">
                        <onfocus>Dialog.Close(212)</onfocus>
                        <onfocus>SetFocus(11)</onfocus>
                </control>
                <control type="button" id="102">
                        <visible>StringCompare(VideoPlayer.PlotOutline,Stream.IsPlaying) + !Skin.HasSetting(Stream.IsUpdated)</visible>
                        <onfocus>RunScript(script.test)</onfocus>
                        <onfocus>SetFocus(11)</onfocus>
                </control>
                <control type="button" id="103">
                        <visible>SubString(VideoPlayer.PlotOutline,Video.IsPlaying)</visible>
                        <onfocus>RunScript("/.xbmc/addons/script.hello.world/default.py","$INFO[VideoPlayer.Album]","$INFO[VideoPlayer.Genre]")</onfocus>
                </control>
        </controls>
</window>

>>>

正如您可以看到最后ID为“103”的 onfocus 元素，＆amp; quot 不再是原始形式，而是导致错误，如果“$ INFO [VideoPlayer.Album]”变量包含嵌套引号并变为“”test“”，这是无效和错误。

那么我能以原始形式保持＆amp; quot << / strong>的任何黑客方式吗？

[UPDATE]： 对于感兴趣的人，其他3个预定义的xml实体，即 gt ， lt 和 amp 只会通过 method =转换“html”和脚本标记。 lxml VS xml.etree.ElementTree或python2 VS python3都有相同的机制，让人混淆：

>>> from lxml import etree as et >>> r = et.fromstring("<root><script>"'&><</script><p>"'&><</p></root>") >>> print et.tostring(r, pretty_print=1, method="xml") <root> <script>"'&><</script> <p>"'&><</p> </root> >>> print et.tostring(r, pretty_print=1, method="html") <root><script>"'&><</script><p>"'&><</p></root> >>>

[UPDATE2]： 以下是所有可能的html标签列表：

#https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'article', 'aside', 'audio', 'b', 'big', 'blockquote', 'br', 'button', 'canvas', 'caption', 'center', 'cite', 'code', 'col', 'colgroup', 'command', 'datagrid', 'datalist', 'dd', 'del', 'details', 'dfn', 'dialog', 'dir', 'div', 'dl', 'dt', 'em', 'event-source', 'fieldset', 'figcaption', 'figure', 'footer', 'font', 'form', 'header', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 'input', 'ins', 'keygen', 'kbd', 'label', 'legend', 'li', 'm', 'map', 'menu', 'meter', 'multicol', 'nav', 'nextid', 'ol', 'output', 'optgroup', 'option', 'p', 'pre', 'progress', 'q', 's', 'samp', 'section', 'select', 'small', 'sound', 'source', 'spacer', 'span', 'strike', 'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'textarea', 'time', 'tfoot', 'th', 'thead', 'tr', 'tt', 'u', 'ul', 'var', 'video'] from lxml import etree as et for e in acceptable_elements: r = et.fromstring(e.join(["<", ">hello&world</", ">"])) s = et.tostring(r, pretty_print=1, method="html") closed_tag = "</" + e + ">" if closed_tag not in s: print s

运行此代码，您将看到如下输出：

<area> <br> <col> <hr> <img> <input>

正如你所看到的，只打开打开的标签，其余的只是进入黑洞。我测试了所有5个xml实体，并且都具有相同的行为。这太令人困惑了。使用HTMLParser时没有发生这种情况，所以我猜想fromstring（方法应该默认为xml）和tostring（method =“html”）步骤之间存在错误。我发现它与实体无关，因为“＆lt; img＆gt; hello＆lt; / img＆gt;”（没有实体）被截断为＆lt; img＆gt;也是（你好，只是去了无处，如果使用method =“xml”打印出来，它可以随时出现）。

Answer 1

from xml.sax.saxutils import escape
from lxml import etree

def to_string(xdoc):
    r = ""
    for action, elem in etree.iterwalk(xdoc, events=("start", "end")):
        if action == 'start':
            text = escape(elem.text, {"'": "&apos;", "\"": "&quot;"}) if elem.text is not None else ""
            attrs = "".join([' %s="%s"' % (k, v) for k, v in elem.attrib.items()])
            r += "<%s%s>%s" % (elem.tag, attrs, text)
        elif action == 'end':
            r += "</%s>%s" % (elem.tag, elem.tail if elem.tail else "\n")
    return r
xdoc = etree.fromstring(xml_text)
s = to_string(xdoc)

lxml - 有没有任何hacky方式来保持＆＃34;？

1 个答案: