Question

我想解析一些我作为字符串获取的XML文档

import lxml.etree
import re
from lxml.html.soupparser import fromstring,parse

try:
    from bs4 import UnicodeDammit             # BeautifulSoup 4

    def decode_html(html_string):
        converted = UnicodeDammit(html_string)
        if not converted.unicode_markup:
            raise UnicodeDecodeError(
                "Failed to detect encoding, tried [%s]",
                ', '.join(converted.tried_encodings))
        # print converted.original_encoding
        return converted.unicode_markup

except ImportError:
    from BeautifulSoup import UnicodeDammit   # BeautifulSoup 3

    def decode_html(html_string):
        converted = UnicodeDammit(html_string, isHTML=True)
        if not converted.unicode:
            raise UnicodeDecodeError(
                "Failed to detect encoding, tried [%s]",
                ', '.join(converted.triedEncodings))
        # print converted.originalEncoding
        return converted.unicode


def tryMe(inString):

    root = fromstring(decode_html(inString))

    #print tostring(root, pretty_print=True).strip()

    backups = root.xpath(".//p3")
    nodes = root.xpath("./doc/p1/p2/p3[contains(text(),'ABC')]//preceding::p1//p3")

    if not nodes:

        print "No XYZ"
        nodes = root.xpath("./doc/p1/p2/p3[contains(text(),'XYZ')]//preceding::p1//p3") 

        if not nodes:

            print "No ABC"
            return " ".join([re.sub('[\s+]', ' ', para.text.strip()) for para in backups])

        else:

            return " ".join([re.sub('[\s+]', ' ', para.text.strip()) for para in nodes])
    else:
        return " ".join([re.sub('[\s+]', ' ', para.text.strip()) for para in nodes])

基本上我想查找具有ABC文本的标签<p3>。如果找到此节点，我将忽略此后的所有内容。因此xpath。否则，我会使用文本XYZ查找标记<p3>。如果发现这一点，我会忽略此后发生的一切。否则，我只处理所有<p3>个节点并返回。

这适用于utf-8文档，但对于utf-16失败。对于任何utf-16文档，我总是得到一个空字符串。即使我可以看到标签<p3>的xml节点有ABC和XYZ等文本。我注意到了而不是预期的

<p3>ABC</p3>

utf-16文档文本显示为

&lt;p3&gt;ABC&lt;/p3&gt;

因此lxml.etree无法将其解析为正确的xml。

我应该如何解决这个问题？

Python：如何使用UTF 16解析XML

0 个答案: