Question

考虑这个功能：

def escape(text):
    print repr(text)
    escaped_chars = []
    for c in text:
        try:
            c = c.decode('ascii')
        except UnicodeDecodeError:
            c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
        escaped_chars.append(c)
    return ''.join(escaped_chars)

它应该通过相应的htmlentitydefs转义所有非ascii字符。不幸的是python抛出

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)

当变量text包含repr()为u'Tam\xe1s Horv\xe1th'的字符串时。

但是，我不使用str.encode()。我只使用str.decode()。我想念一下吗？

Answer 1

这是一个误导性的错误报告，它来自python处理de /编码过程的方式。你试图第二次解码一个已经解码过的字符串，这会混淆Python函数，它会让你反过来混淆你的报复！ ;-)编码/解码过程据我所知，由编解码器模块进行。而某些地方存在这种误导性的异常消息的起源。

您可以自己检查：

u'\x80'.encode('ascii')

或

u'\x80'.decode('ascii')

将抛出Unicode 编码错误，其中一个

u'\x80'.encode('utf8')

不会，但

u'\x80'.decode('utf8')

再一次！

我猜你对编码和解码的含义感到困惑。说实话：

                     decode             encode    
ByteString (ascii)  --------> UNICODE  --------->  ByteString (utf8)
            codec                                              codec

但为什么codec方法存在decode - 参数？好吧，底层函数无法猜测ByteString编码的编解码器，因此提示它需要codec作为参数。如果未提供，则假定您的意思是隐式使用sys.getdefaultencoding()。

所以当你使用c.decode('ascii')时a）有一个（编码的）ByteString（这就是你使用解码的原因）b）你想获得一个unicode表示对象（这就是你使用解码的对象）和c ）编码ByteString的编解码器是ascii。

另见：     https://stackoverflow.com/a/370199/1107807
    http://docs.python.org/howto/unicode.html
    http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
    http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror

Answer 2

你传递的字符串已经是unicode了。因此，在Python上可以调用decode之前，它必须对它进行实际编码 - 默认情况下它使用ASCII编码。

编辑以添加这取决于您要执行的操作。如果您只想将带有非ASCII字符的unicode字符串转换为HTML编码表示形式，则可以通过一次调用完成：text.encode('ascii', 'xmlcharrefreplace')。

Answer 3

Python有两种类型的字符串：字符串（unicode类型）和字节串（str类型）。您粘贴的代码在字节字符串上运行。你需要一个类似的函数来处理字符串。

也许这个：

def uescape(text):
    print repr(text)
    escaped_chars = []
    for c in text:
        if (ord(c) < 32) or (ord(c) > 126):
            c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
        escaped_chars.append(c)
    return ''.join(escaped_chars)

我确实想知道这两种功能是否真的对你有用。如果是我，我会选择UTF-8作为结果文档的字符编码，以字符串形式处理文档（不用担心实体），并在交付之前执行content.encode('UTF-8')作为最后一步给客户。根据所选的Web框架，您甚至可以直接向API提供字符串，并让它弄清楚如何设置编码。

Answer 4

当我遇到这个问题时，这个答案总是对我有用：

def byteify(input):
    '''
    Removes unicode encodings from the given input string.
    '''
    if isinstance(input, dict):
        return {byteify(key):byteify(value) for key,value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

来自How to get string objects instead of Unicode ones from JSON in Python?

Answer 5

我在this-site

中找到了解决方案

reload(sys)
sys.setdefaultencoding("latin-1")

a = u'\xe1'
print str(a) # no exception

Answer 6

decode str毫无意义。

我认为您可以查看ord(c)>127

虽然我正在做str.decode（），但Python会抛出UnicodeEncodeError。为什么？

6 个答案: