Question

我的代码只是抓取一个网页，然后将其转换为Unicode。

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

但我得到UnicodeDecodeError：

Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

我认为这意味着HTML包含某些错误形成的Unicode尝试。 我可以删除导致问题的代码字节而不是出错吗？

Answer 1

>>> u'aあä'.encode('ascii', 'ignore')
'a'

修改

使用响应中相应meta标记中的字符集或Content-Type标头中的字符集解码您获得的字符串，然后进行编码。

方法encode()接受其他值为“ignore”。例如：'replace'，'xmlcharrefreplace'，'backslashreplace'。见https://docs.python.org/3/library/stdtypes.html#str.encode

Answer 2

作为Ignacio Vazquez-Abrams回答的延伸

>>> u'aあä'.encode('ascii', 'ignore')
'a'

有时需要从字符中删除重音并打印基本表单。这可以通过

完成

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'

您可能还希望将其他字符（例如标点符号）转换为最接近的等效字符，例如，在编码时，RIGHT SINGLE QUOTATION MARK unicode字符不会转换为ascii APOSTROPHE。

>>> print u'\u2019'
’
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"

虽然有更有效的方法可以实现这一目标。有关详细信息，请参阅此问题Where is Python's "best ASCII for this Unicode" database?

Answer 3

2018年更新：

截至2018年2月，使用gzip等压缩变为quite popular（约73％的网站使用它，包括Google，YouTube，Yahoo，Wikipedia，Reddit，Stack Overflow等大型网站堆叠交换网络站点）。
如果您使用gzip响应进行原始答案中的简单解码，则会出现类似或类似的错误：

UnicodeDecodeError：'utf8'编解码器无法解码位置1的字节0x8b：意外的代码字节

为了解码gzpipped响应，您需要添加以下模块（在Python 3中）：

import gzip
import io

注意： In Python 2 you'd use StringIO instead of io

然后你可以像这样解析内容：

response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource

此代码读取响应，并将字节放在缓冲区中。然后gzip模块使用GZipFile函数读取缓冲区。之后，gzip压缩文件可以再次读入字节并最终解码为正常的可读文本。

2010年的原始答案：

我们可以获得link使用的实际值吗？

此外，当我们尝试.encode()已编码的字节字符串时，我们通常会遇到此问题。所以你可能会尝试先解码它，就像在

中一样

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

举个例子：

html = '\xa0'
encoded_str = html.encode("utf8")

失败

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

虽然：

html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")

成功没有错误。请注意，“windows-1252”是我用作示例的东西。我从chardet得到了这个，它有0.5信心它是对的！（好吧，用1个字符长度的字符串给出，你期望什么）你应该将它改为从.urlopen().read()返回的字节串的编码，以适用于你检索的内容。

我看到的另一个问题是.encode()字符串方法返回修改后的字符串，并且不会修改源代码。所以有self.response.out.write(html)是没用的，因为html不是来自html.encode的编码字符串（如果这是你最初的目标）。

正如Ignacio建议的那样，检查源网页，查看read()返回字符串的实际编码。它位于Meta标签之一或响应中的ContentType标头中。然后将其用作.decode()的参数。

请注意，不应该假设其他开发人员有足够的责任来确保标头和/或元字符集声明与实际内容相匹配。（这是一个PITA，是的，我应该知道，我之前其中之一）。

Answer 4

使用 unidecode - 它甚至可以立即将奇怪的字符转换为ascii，甚至可以将中文转换为拼音ascii。

$ pip install unidecode

然后：

>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'

Answer 5

我在所有项目中使用此辅助函数。如果它无法转换unicode，它会忽略它。这与django图书馆有关，但通过一些研究可以绕过它。

from django.utils import encoding

def convert_unicode_to_string(x):
    """
    >>> convert_unicode_to_string(u'ni\xf1era')
    'niera'
    """
    return encoding.smart_str(x, encoding='ascii', errors='ignore')

使用此功能后，我不再收到任何unicode错误。

Answer 6

对于cmd.exe和HTML输出等损坏的控制台，您始终可以使用：

my_unicode_string.encode('ascii','xmlcharrefreplace')

这将保留所有非ascii字符，同时使它们可以在HTML中以纯ASCII 和打印。

警告：如果您在生产代码中使用此功能以避免错误，那么您的代码中很可能出现了错误。唯一有效的用例是打印到非unicode控制台或轻松转换为HTML上下文中的HTML实体。

最后，如果您在Windows上并使用cmd.exe，则可以键入chcp 65001以启用utf-8输出（与Lucida Console字体一起使用）。您可能需要添加myUnicodeString.encode('utf8')。

Answer 7

你写了“”“我认为这意味着HTML包含了某个地方unicode的错误形式的尝试。”“”

HTML不应包含任何类型的“unicode尝试”，格式是否良好。它必须包含以某种编码编码的Unicode字符，通常在前面提供...寻找“charset”。

您似乎假设字符集是UTF-8 ......基于什么理由？错误消息中显示的“\ xA0”字节表示您可能有一个单字节字符集，例如CP1252。

如果您无法从HTML开头的声明中获得任何意义，请尝试使用chardet找出可能的编码。

为什么要用“正则表达式”标记您的问题？

用非问题替换整个问题后

更新：

html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.

html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using 
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't 
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object 
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)

Answer 8

如果您有字符串line，则可以使用.encode([encoding], [errors='strict'])方法将字符串转换为编码类型。

line = 'my big string'

line.encode('ascii', 'ignore')

有关在Python中处理ASCII和unicode的更多信息，这是一个非常有用的网站：https://docs.python.org/2/howto/unicode.html

Answer 9

我认为答案就在那里，但只有点点滴滴，这使得很难快速解决问题，例如

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

让我们举一个例子，假设我的文件有以下形式的数据（包含ascii和非ascii字符）

1/10 / 17,21：36 - 土地：欢迎ï¿½ï¿½

我们想要忽略并保留ascii字符。

此代码将执行：

import unicodedata
fp  = open(<FILENAME>)
for line in fp:
    rline = line.strip()
    rline = unicode(rline, "utf-8")
    rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore')
    if len(rline) != 0:
        print rline

和类型（rline）会给你

>type(rline) 
<type 'str'>

Answer 10

unicodestring = '\xa0'

decoded_str = unicodestring.decode("windows-1252")
encoded_str = decoded_str.encode('ascii', 'ignore')

为我工作

Answer 11

看起来你正在使用python 2.x. Python 2.x默认为ascii，它不知道Unicode。因此例外。

在shebang之后粘贴下面的行，它会起作用

# -*- coding: utf-8 -*-

在Python中将Unicode转换为ASCII而没有错误

11 个答案:

2018年更新：

2010年的原始答案：