Question

我已将Excel文件转换为csv，目标是使用python分析此数据集。因此，在使用此代码导入模块和数据集之后

Import pandas as pd
Import numpy as np
Import matplotlib as mlt

pd.read_csv('filename.csv')

我收到以下消息：

"'utf-8' codec can't decode byte 0xbf in position 6: invalid start byte"

我在网上搜索，但这些解决方案均不适用于我的问题，老实说，我不知道该怎么办。

Answer 1

首先，您需要知道文件的实际字符编码。不是UTF-8。

有很多不同的字符编码，有时Excel会将编码更改为'iso-8859-1'或'cp1252'，这很疯狂。

以下是每个IT人员都必须知道的重要信息：The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

要解决您的问题，至少有三个选择：

1）尝试一些可能的东西（latin1，cp1252等）：

df= pd.read_csv('file.csv',encoding ='latin1')

2）阅读之前，请使用UTF-8编码（或其他原始格式）保存文件。打开Windows（Excel）并更新某些行后，Windows可能会更改其编码。

3）弄清这一点的一种方法是尝试测试一堆不同的字符编码，然后查看它们中的任何一种是否起作用。但是，更好的方法是使用chardet模块尝试自动猜测正确的编码是什么。并非100％保证是正确的，但通常比尝试猜测要快：

import chardet

# look at the first ten thousand bytes to guess the character encoding
with open('file.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

{'encoding': 'Windows-1252', 'confidence': 0.99, 'language': ''}

# read in the file with the encoding detected by chardet
df = pd.read_csv('file.csv', encoding='Windows-1252')

加载数据集时，为什么会出现“ Unicode解码错误”消息？

1 个答案: