我有以下代码,但python 3没有将垂直管道识别为unicode字符。
m_cols = ['movie_id', 'title', 'release_date',
'video_release_date', 'imdb_url']
movies = pd.read_csv(
'http://files.grouplens.org/datasets/movielens/ml-100k/u.item',
sep='|', names=m_cols, usecols=range(5))
movies.head()
我收到以下错误
UnicodeDecodeError Traceback (most recent call
last)
pandas\_libs\parsers.pyx in
pandas._libs.parsers.TextReader._convert_tokens
(pandas\_libs\parsers.c:14858)()
pandas\_libs\parsers.pyx in
pandas._libs.parsers.TextReader._convert_with_dtype
(pandas\_libs\parsers.c:17119)()
pandas\_libs\parsers.pyx in
pandas._libs.parsers.TextReader._string_convert
(pandas\_libs\parsers.c:17347)()
pandas\_libs\parsers.pyx in pandas._libs.parsers._string_box_utf8
(pandas\_libs\parsers.c:23041)()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3:
invalid continuation byte
During handling of the above exception, another exception occurred:
UnicodeDecodeError Traceback (most recent call
last)
<ipython-input-15-72a8222212c1> in <module>()
4 movies = pd.read_csv(
5 'http://files.grouplens.org/datasets/movielens/ml-100k/u.item',
----> 6 sep='|', names=m_cols, usecols=range(5))
7
8 movies.head()
这背后可能的原因是什么,我该如何解决这个问题?
答案 0 :(得分:1)
在python3中,使用encoding="latin-1"
:
In [9]: movies = pd.read_csv(
'http://files.grouplens.org/datasets/movielens/ml-100k/u.item',
sep='|', names=m_cols, usecols=range(5), header=None, encoding="latin-1")
In [10]: movies.head()
Out[10]:
movie_id title release_date video_release_date \
0 1 Toy Story (1995) 01-Jan-1995 NaN
1 2 GoldenEye (1995) 01-Jan-1995 NaN
2 3 Four Rooms (1995) 01-Jan-1995 NaN
3 4 Get Shorty (1995) 01-Jan-1995 NaN
4 5 Copycat (1995) 01-Jan-1995 NaN
imdb_url
0 http://us.imdb.com/M/title-exact?Toy%20Story%2...
1 http://us.imdb.com/M/title-exact?GoldenEye%20(...
2 http://us.imdb.com/M/title-exact?Four%20Rooms%...
3 http://us.imdb.com/M/title-exact?Get%20Shorty%...
4 http://us.imdb.com/M/title-exact?Copycat%20(1995)