我正在尝试读取名为df1的数据集,但它不起作用
import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";")
df1.head()
以上是代码中的巨大错误,但这是最相关的
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte
答案 0 :(得分:5)
数据确实没有编码为UTF-8;除了单个0x92字节外,一切都是ASCII:
b'Korea, Dem. People\x92s Rep.'
将其解码为Windows codepage 1252,其中0x92是一个奇特的引用,’
:
df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
sep=";", encoding='cp1252')
演示:
>>> import pandas as pd
>>> df1 = pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",
... sep=";", encoding='cp1252')
>>> df1.head()
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 \
0 Afghanistan 55.1 55.5 55.9 56.2 56.6 57.0 57.4 57.8 58.2 58.6
1 Albania 74.3 74.7 75.2 75.5 75.8 76.1 76.3 76.5 76.7 76.8
2 Algeria 70.2 70.6 71.0 71.4 71.8 72.2 72.6 72.9 73.2 73.5
3 American Samoa .. .. .. .. .. .. .. .. .. ..
4 Andorra .. .. .. .. .. .. .. .. .. ..
2010 2011 2012 2013 Unnamed: 15 2014 2015
0 59.0 59.3 59.7 60.0 NaN 60.4 60.7
1 77.0 77.2 77.4 77.6 NaN 77.8 78.0
2 73.8 74.1 74.3 74.6 NaN 74.8 75.0
3 .. .. .. .. NaN .. ..
4 .. .. .. .. NaN .. ..
但是我注意到,Pandas似乎也将HTTP标头置于面值 ,并在从URL加载数据时生成Mojibake。当我将数据直接保存到磁盘时,然后加载pd.read_csv()
数据被正确解码,但从URL加载会产生重新编码的数据:
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
>>> df1[' '][102].encode('cp1252').decode('utf8')
'Korea, Dem. People’s Rep.'
这是known bug in Pandas。您可以使用urllib.request
加载网址并将其传递给pd.read_csv()
来解决此问题:
>>> import urllib.request
>>> with urllib.request.urlopen("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv") as resp:
... df1 = pd.read_csv(resp, sep=";", encoding='cp1252')
...
>>> df1[' '][102]
'Korea, Dem. People’s Rep.'
答案 1 :(得分:0)
原来,是在Windows机器上解析在Mac OS中创建的csv,我得到了UnicodeDecodeError。 要消除此错误,请尝试将参数encoding ='mac-roman'传递给pandas库的read_csv方法。
import pandas as pd
df1=pd.read_csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/World_Life_Expectancy.csv",sep=";", encoding='mac_roman')
df1.head()
输出:
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Unnamed: 15 2014 2015
0 Afghanistan 55.1 55.5 55.9 56.2 56.6 57.0 57.4 57.8 58.2 58.6 59.0 59.3 59.7 60.0 NaN 60.4 60.7
1 Albania 74.3 74.7 75.2 75.5 75.8 76.1 76.3 76.5 76.7 76.8 77.0 77.2 77.4 77.6 NaN 77.8 78.0
2 Algeria 70.2 70.6 71.0 71.4 71.8 72.2 72.6 72.9 73.2 73.5 73.8 74.1 74.3 74.6 NaN 74.8 75.0
3 American Samoa .. .. .. .. .. .. .. .. .. .. .. .. .. .. NaN .. ..
4 Andorra .. .. .. .. .. .. .. .. .. .. .. .. .. .. NaN .. ..
答案 2 :(得分:0)
出现此问题是因为您的文件中有一些未知字符。 例如,在您使用 utf-8 编码的文件中,Windows 1250 中有一些字符。 你应该删除或替换这个字符来解决你的问题