Question

我正在使用Python-2.6 CGI脚本但在执行json.dumps()时在服务器日志中发现此错误，

Traceback (most recent call last):
  File "/etc/mongodb/server/cgi-bin/getstats.py", line 135, in <module>
    print json.dumps(__getdata())
  File "/usr/lib/python2.7/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib/python2.7/json/encoder.py", line 201, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 264, in iterencode
    return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte

在这里，

__getdata()函数返回dictionary {}。

在发布此问题之前，我已提及问题的 this 。

更新

以下行损害了JSON编码器，

now = datetime.datetime.now()
now = datetime.datetime.strftime(now, '%Y-%m-%dT%H:%M:%S.%fZ')
print json.dumps({'current_time': now}) // this is the culprit

我得到了一个临时修复

print json.dumps( {'old_time': now.encode('ISO-8859-1').strip() })

但我不确定这是否正确。

Answer 1

错误是因为字典中有一些非ascii字符，无法对其进行编码/解码。避免此错误的一种简单方法是使用encode()函数对此类字符串进行编码，如下所示（如果a是具有非ascii字符的字符串）：

a.encode('utf-8').strip()

Answer 2

请尝试以下代码段：

with open(path, 'rb') as f:
  text = f.read()

Answer 3

我只是通过在read_csv()命令

中定义不同的编解码器包来切换它

encoding = 'unicode_escape'

Answer 4

你的字符串中有一个非ascii字符。

如果您需要在代码中使用其他编码，则可能无法使用utf-8进行解码。例如：

>>> 'my weird character \x96'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 19: invalid start byte

在这种情况下，编码是windows-1252所以你必须这样做：

>>> 'my weird character \x96'.decode('windows-1252')
u'my weird character \u2013'

现在您已经拥有了unicode，您可以安全地编码为utf-8。

Answer 5

此解决方案对我有用：

extras = getIntent().getExtras();
String a = extras.get("KEY");

if (a.equlas("A_CLASS") {
    class_a_extras = getIntent();
}

Answer 6

将默认编码器设置在代码顶部

import sys
reload(sys)
sys.setdefaultencoding("ISO-8859-1")

Answer 7

受到aaronpenne和Soumyaansh的启发

f    = open("file.txt","rb")
text = f.read().decode(errors='replace')

Answer 8

截至2018-05，直接使用decode, at least for Python 3处理。

我在收到invalid start byte和invalid continuation byte类型错误后使用以下代码段。添加errors='ignore'为我修复了它。

with open(out_file, 'rb') as f:
    for line in f:
        print(line.decode(errors='ignore'))

Answer 9

在读取csv时，我添加了一种编码方法

import pandas as pd
dataset = pd.read_csv('sample_data.csv',header=0,encoding = 'unicode_escape')

Answer 10

以下代码段对我有用。

import pandas as pd
df = pd.read_csv(filename, sep = ';', encoding = 'latin1', error_bad_lines=False) #error_bad_lines is avoid single line error

Answer 11

以下行损害了JSON编码器，

now = datetime.datetime.now()
now = datetime.datetime.strftime(now, '%Y-%m-%dT%H:%M:%S.%fZ')
print json.dumps({'current_time': now}) // this is the culprit

我得到了一个临时修复

print json.dumps( {'old_time': now.encode('ISO-8859-1').strip() })

将此标记为临时修正（不确定）。

Answer 12

简单解决方案：

import pandas as pd
df = pd.read_csv('file_name.csv', engine='python')

Answer 13

如果上述方法对您不起作用，则可能需要考虑更改csv文件本身的编码。

使用Excel：

1. Open csv file using Excel
2. Navigate to "File menu" option and click "Save As"
3. Click "Browse" to select a location to save the file
4. Enter intended filename
5. Select CSV (Comma delimited) (*.csv) option
6. Click "Tools" drop-down box and click "Web Options"
7. Under "Encoding" tab, select the option Unicode (UTF-8) from "Save this document as" drop-down list
8. Save the file

使用记事本：

1. Open csv file using notepad
2. Navigate to "File" > "Save As" option
3. Next, select the location to the file
4. Select the Save as type option as All Files(*.*)
5. Specify the file name with .csv extension
6. From "Encoding" drop-down list, select UTF-8 option.
7. Click Save to save the file

这样做，您应该能够导入csv文件，而不会遇到UnicodeCodeError。

Answer 14

您可以使用任何特定用法和输入的标准编码。

“ utf-8”是默认设置。

“ iso8859-1”在西欧也很流行。

例如：bytes_obj.decode('iso8859-1')

请参阅： https://docs.python.org/3/library/codecs.html#standard-encodings

Answer 15

不要寻找解码a5（日元¥）或96（破折号–）的方法，而是告诉MySQL您的客户端编码为“ latin1”，但是您希望在其中使用“ utf8”数据库。

查看Trouble with UTF-8 characters; what I see is not what I stored

中的详细信息

Answer 16

尝试上述所有变通方法后，如果仍然出现相同的错误，您可以尝试将文件导出为CSV（如果已经存在，则第二次）。特别是如果您使用scikit learn，最好将数据集导入为CSV文件。

我花了好几个小时，而解决方案就是这么简单。将文件作为CSV导出到安装Anaconda或分类器工具的目录并尝试。

Answer 17

一般来说，

当尝试将非法类型的对象作为文件读取时，Python会抛出此类错误。

e.g。

file = open("xyz.pkl", "r") text= file.read()

第二行会抛出上述错误：

UnicodeDecodeError：＆＃39; utf-8＆＃39;编解码器不能解码位置0中的字节0x80：无效的起始字节

以类似方式阅读 .npy 也会引发此类错误

Answer 18

就我而言，我不得不将文件另存为具有BOM表的 UTF8 ，而不仅仅是将其保存为UTF8 utf8，那么该错误就消失了。

Answer 19

from io import BytesIO

df = pd.read_excel(BytesIO(bytes_content), engine='openpyxl')

为我工作

Answer 20

就我而言，如果我将xslx excel文件保存为CSV（逗号分隔），则会出现错误。但是，当我保存为CSV（MS-DOS）时，错误将不会出现。

Answer 21

HitHere，你应该加载＆＃34; GoogleNews-vectors-negative300.bin.gz＆＃34;文件首先在Ubuntu中通过此命令提取它：gunzip -k GoogleNews-vectors-negative300.bin.gz。 [永远不建议手动提取]。

其次，您应该在python 3中应用这些命令：

import gensim
model = gensim.models.Word2Vec.load_word2vec_format('./model/GoogleNews-vectors-negative300.bin', binary=True)

我希望它会有用。

UnicodeDecodeError：'utf8'编解码器无法解码位置0中的字节0xa5：无效的起始字节

更新

21 个答案: