Question

这似乎我使用了错误的功能。使用.fromstring - 没有错误消息

xml_ = load() # here comes the unicode string with Cyrillic letters 

print xml_    # prints everything fine 

print type(xml_) # 'lxml.etree._ElementUnicodeResult' = unicode 

xml = xml_.decode('utf-8') # here is an error

doc = lxml.etree.parse(xml) # if I do not decode it - the same error appears here

 File "testLog.py", line 48, in <module>
    xml = xml_.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 89-96: ordinal not in range(128)

如果

xml = xml_.encode('utf-8')

doc = lxml.etree.parse(xml) # here's an error

或

xml = xml_

然后

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 89: ordinal not in range(128)

如果我理解正确：我必须将非ascii字符串解码为内部表示，然后使用此表示并在发送到输出之前对其进行编码？我似乎正是这样做的。

由于'Accept-Charset': 'utf-8'标题，输入数据必须为非-8。

Answer 1

字符串和Unicode对象在内存中具有不同的类型和内容的不同表示。 Unicode是文本的解码形式，而字符串是编码形式。

# -*- coding: utf-8 --

# Now, my string literals in this source file will
#    be str objects encoded in utf-8.

# In Python3, they will be unicode objects.
#    Below examples show the Python2 way.

s = 'ş'
print type(s) # prints <type 'str'>

u = s.decode('utf-8')
# Here, we create a unicode object from a string
#    which was encoded in utf-8.

print type(u) # prints <type 'unicode'>

如你所见，

.encode() --> str
.decode() --> unicode

当我们对字符串进行编码或解码时，我们需要确保我们的文本应该包含在源/目标编码中。使用iso-8859-9无法正确解码iso-8859-1编码的字符串。

对于问题中的第二个错误报告，lxml.etree.parse()适用于类似文件的对象。要从字符串解析，应使用lxml.etree.fromstring()。

Answer 2

如果你的原始字符串是unicode，那么将它编码为utf-8而不是从utf-8解码是有意义的。

我认为xml解析器只能处理ascii的xml。

因此，使用xml = xml_.encode('ascii','xmlcharrefreplace')将不在ascii中的unicode字符转换为xml权限。

Answer 3

lxml库已经为您提供了unicode类型。你正在运行python2的unicode / bytes自动转换。对此的提示是，您要求decode但是您收到编码错误。它试图将你的utf8字符串转换为默认字节编码，然后将其解码回unicode。

在unicode对象上使用.encode方法转换为字节（str类型）。

观看此内容将教会您如何解决此问题：http://nedbatchelder.com/text/unipain.html

Answer 4

我假设您正在尝试解析一些网站？

您验证网站是否正确？也许他们的编码不正确？

许多网站都被破坏并依赖网络浏览器来拥有非常强大的解析器。你可以尝试一下beautifulsoup，它也非常强大。

有一个事实上的网络标准，“Charset”HTML标头（可能包括协商并与您提及的Accept-Encoding相关）被HTML文件中的任何<meta http-equiv=...标记否决了！

所以你可能只是不输入了UTF-8！

Answer 5

对我来说，使用.fromstring()方法是必要的。

Python：我使用.decode（） - 'ascii'编解码器无法编码

5 个答案: