输入中不支持的字符(Python 2.7.9)

时间:2015-01-03 22:01:22

标签: python-2.7 utf-8

来自新手的一个小问题。我试图做一个小函数,它随机化文本的内容。

#-*- coding: utf-8 -*-
import random

def glitch(text):
    new_text = ['']
    for x in text:
        new_text.append(x)
        random.shuffle(new_text)
    return ''.join(new_text)

正如您所看到的那样,输入简单的字符串非常简单,输出就像'嘿,你好吗?'将导致预测的随机句子。但是,当我尝试粘贴类似于此的东西时:

  

打印故障('Iàäï †n $§&0ñŒ≥Q¶μù`o¢y“-œº')

... Python 2.7.9返回 '输入' 中不支持的字符 - 我已经浏览了论坛,并且根据我的理解尝试了一些事情,因为我一般都是新编码,但无济于事。

有什么建议吗?

感谢。

2 个答案:

答案 0 :(得分:0)

#-*- coding: utf-8 -*-
import random

def glitch(text):

    new_text = ['']
    for x in text:
        new_text.append(x)
        random.shuffle(new_text)
    return ''.join(new_text)

print (glitch(u'Iàäï†n$§&0ñŒ≥Q¶µù`o¢y”—œº'))

这应该可行,通过我自己的快速谷歌搜索,我发现,你必须在字母'u'之前加上,以将下面的文字标记为unicode。

来源:Unsupported characters in input

答案 1 :(得分:-1)

您的问题是Python 2.x - 而不是您的Python 2的特定版本.Python 2.x使用ascii而不是Unicode编码(在Python 3中更改),并且您的字符串(likley)编码为utf-8。见如下:

import chardet
text = 'Iàäï†n$§&0ñŒ≥Q¶µù`o¢y”—œº'
print chardet.detect(text)['encoding'] # prints utf-8

如果您下载Python 3.X,您的问题可能会得到解决,since UTF-8 can handle any Unicode code point

如果您感兴趣 - 或未来2.x用户 - 您可以执行以下操作。

def glitch(text):
    new_text = []
    for x in text:
        new_text.append(x)
    random.shuffle(new_text) #note you should just shuffle once - not every iteration.
    new_line = ''.join(new_text) # this line is where your encoding moves from `utf-8` to `ascii`
    # this becomes `ascii` because of the empty string you use to join your list.  it defaults to `ascii`
    # if you tried to make it `unicode` by doing `u''.join(list)` you would get a `UnicodeDecodeError`
    return new_line.decode("ascii", "ignore").encode("utf-8") # note the [ignore][2].  it bypasses encoding errors.
    # now your code will run and return a string of utf-8 characters 
    # (to which we encode new_line, and which is the default encoding of a string anytime you `decode()` it.)
    # note that you will return a shorter string, because (again) `ascii` can only represent 
    # 128 characters by default, whereas some of your `utf-8` string is represented by 
    # characters b/w 129 & 255.

我希望这有帮助并且有意义。网上有很多材料讨论这个问题(包括我自己的多个问题 - for example :))