计算DataFrame中每部电影的平均评分时,如何放置NaN?

时间:2020-03-26 09:31:53

标签: python pandas data-science

我正在使用MovieLens数据集,基本上有2个文件,一个.csv文件包含电影,另一个.csv文件包含n个用户对特定电影的评级。

为了获得DF中每部电影的平均评分,我执行了以下操作。

ratings_data.groupby('movieId').rating.mean()

但是使用该代码,我得到了9724部电影,而在主DataFrame中却获得了9742部电影。
我认为有些电影根本没有评级,但是由于我想将评级添加到主要电影数据集中,如何将NaN放在没有评级的字段上?!

1 个答案:

答案 0 :(得分:1)

在另一列中以唯一的movieId形式使用Series.reindex,因为添加相同的顺序是Series.sort_values

movies_data = pd.read_csv('ml-latest-small/movies.csv')
ratings_data = pd.read_csv('ml-latest-small/ratings.csv')

mov = movies_data['movieId'].sort_values().drop_duplicates()  
df = ratings_data.groupby('movieId').rating.mean().reindex(mov).reset_index()
print (df)
      movieId    rating
0           1  3.920930
1           2  3.431818
2           3  3.259615
3           4  2.357143
4           5  3.071429
      ...       ...
9737   193581  4.000000
9738   193583  3.500000
9739   193585  3.500000
9740   193587  3.500000
9741   193609  4.000000

[9742 rows x 2 columns]

df1 = df[df['rating'].isna()]
print (df1)
      movieId  rating
816      1076     NaN
2211     2939     NaN
2499     3338     NaN
2587     3456     NaN
3118     4194     NaN
4037     5721     NaN
4506     6668     NaN
4598     6849     NaN
4704     7020     NaN
5020     7792     NaN
5293     8765     NaN
5421    25855     NaN
5452    26085     NaN
5749    30892     NaN
5824    32160     NaN
5837    32371     NaN
5957    34482     NaN
7565    85565     NaN

编辑:

如果需要向movie_data DataFrame添加新列,请在左联接中使用DataFrame.merge

movies_data = pd.read_csv('ml-latest-small/movies.csv')
ratings_data = pd.read_csv('ml-latest-small/ratings.csv')

df = ratings_data.groupby('movieId', as_index=False).rating.mean()
print (df)
      movieId    rating
0           1  3.920930
1           2  3.431818
2           3  3.259615
3           4  2.357143
4           5  3.071429
      ...       ...
9719   193581  4.000000
9720   193583  3.500000
9721   193585  3.500000
9722   193587  3.500000
9723   193609  4.000000

[9724 rows x 2 columns]

df = movies_data.merge(df, on='movieId', how='left')
print (df)
      movieId                                      title  \
0           1                           Toy Story (1995)   
1           2                             Jumanji (1995)   
2           3                    Grumpier Old Men (1995)   
3           4                   Waiting to Exhale (1995)   
4           5         Father of the Bride Part II (1995)   
      ...                                        ...   
9737   193581  Black Butler: Book of the Atlantic (2017)   
9738   193583               No Game No Life: Zero (2017)   
9739   193585                               Flint (2017)   
9740   193587        Bungo Stray Dogs: Dead Apple (2018)   
9741   193609        Andrew Dice Clay: Dice Rules (1991)   

                                           genres    rating  
0     Adventure|Animation|Children|Comedy|Fantasy  3.920930  
1                      Adventure|Children|Fantasy  3.431818  
2                                  Comedy|Romance  3.259615  
3                            Comedy|Drama|Romance  2.357143  
4                                          Comedy  3.071429  
                                          ...       ...  
9737              Action|Animation|Comedy|Fantasy  4.000000  
9738                     Animation|Comedy|Fantasy  3.500000  
9739                                        Drama  3.500000  
9740                             Action|Animation  3.500000  
9741                                       Comedy  4.000000  

[9742 rows x 4 columns]
相关问题