如何摆脱开放数据集中无用的数据

时间:2017-07-12 04:20:12

标签: python pandas

我正在使用找到的开放数据集。具体来说,我正在使用此数据集:http://files.grouplens.org/datasets/movielens/ml-100k/u.item。我正在尝试解析数据集,当我将其加载到pandas中时:

movie_cols = ['movie_id', 'title','release_date','imdb_url']
movies = pd.read_csv('http://files.grouplens.org/datasets/movielens/ml-100k/u.item',sep='|',names=movie_cols)

当我尝试运行时

movies.head()

它显示了这个:

1 个答案:

答案 0 :(得分:1)

功能read_csv中的过滤器usecols列需要参数1., 2., 3. and 5.

movie_cols = ['movie_id', 'title', 'release_date', 'imdb_url']
movies = pd.read_csv('http://files.grouplens.org/datasets/movielens/ml-100k/u.item',
                     sep='|',
                     names=movie_cols,   
                     encoding='latin-1', 
                     usecols = [0,1,2,4])
print (movies.head())
   movie_id              title release_date  \
0         1   Toy Story (1995)  01-Jan-1995   
1         2   GoldenEye (1995)  01-Jan-1995   
2         3  Four Rooms (1995)  01-Jan-1995   
3         4  Get Shorty (1995)  01-Jan-1995   
4         5     Copycat (1995)  01-Jan-1995   

                                            imdb_url  
0  http://us.imdb.com/M/title-exact?Toy%20Story%2...  
1  http://us.imdb.com/M/title-exact?GoldenEye%20(...  
2  http://us.imdb.com/M/title-exact?Four%20Rooms%...  
3  http://us.imdb.com/M/title-exact?Get%20Shorty%...  
4  http://us.imdb.com/M/title-exact?Copycat%20(1995)