我正在使用MovieLens数据集,基本上有2个文件,一个.csv文件包含电影,另一个.csv文件包含n个用户对特定电影的评级。
为了获得DF中每部电影的平均评分,我执行了以下操作。
ratings_data.groupby('movieId').rating.mean()
但是使用该代码,我得到了9724部电影,而在主DataFrame中却获得了9742部电影。
我认为有些电影根本没有评级,但是由于我想将评级添加到主要电影数据集中,如何将NaN放在没有评级的字段上?!
答案 0 :(得分:1)
在另一列中以唯一的movieId
形式使用Series.reindex
,因为添加相同的顺序是Series.sort_values
:
movies_data = pd.read_csv('ml-latest-small/movies.csv')
ratings_data = pd.read_csv('ml-latest-small/ratings.csv')
mov = movies_data['movieId'].sort_values().drop_duplicates()
df = ratings_data.groupby('movieId').rating.mean().reindex(mov).reset_index()
print (df)
movieId rating
0 1 3.920930
1 2 3.431818
2 3 3.259615
3 4 2.357143
4 5 3.071429
... ...
9737 193581 4.000000
9738 193583 3.500000
9739 193585 3.500000
9740 193587 3.500000
9741 193609 4.000000
[9742 rows x 2 columns]
df1 = df[df['rating'].isna()]
print (df1)
movieId rating
816 1076 NaN
2211 2939 NaN
2499 3338 NaN
2587 3456 NaN
3118 4194 NaN
4037 5721 NaN
4506 6668 NaN
4598 6849 NaN
4704 7020 NaN
5020 7792 NaN
5293 8765 NaN
5421 25855 NaN
5452 26085 NaN
5749 30892 NaN
5824 32160 NaN
5837 32371 NaN
5957 34482 NaN
7565 85565 NaN
编辑:
如果需要向movie_data
DataFrame添加新列,请在左联接中使用DataFrame.merge
:
movies_data = pd.read_csv('ml-latest-small/movies.csv')
ratings_data = pd.read_csv('ml-latest-small/ratings.csv')
df = ratings_data.groupby('movieId', as_index=False).rating.mean()
print (df)
movieId rating
0 1 3.920930
1 2 3.431818
2 3 3.259615
3 4 2.357143
4 5 3.071429
... ...
9719 193581 4.000000
9720 193583 3.500000
9721 193585 3.500000
9722 193587 3.500000
9723 193609 4.000000
[9724 rows x 2 columns]
df = movies_data.merge(df, on='movieId', how='left')
print (df)
movieId title \
0 1 Toy Story (1995)
1 2 Jumanji (1995)
2 3 Grumpier Old Men (1995)
3 4 Waiting to Exhale (1995)
4 5 Father of the Bride Part II (1995)
... ...
9737 193581 Black Butler: Book of the Atlantic (2017)
9738 193583 No Game No Life: Zero (2017)
9739 193585 Flint (2017)
9740 193587 Bungo Stray Dogs: Dead Apple (2018)
9741 193609 Andrew Dice Clay: Dice Rules (1991)
genres rating
0 Adventure|Animation|Children|Comedy|Fantasy 3.920930
1 Adventure|Children|Fantasy 3.431818
2 Comedy|Romance 3.259615
3 Comedy|Drama|Romance 2.357143
4 Comedy 3.071429
... ...
9737 Action|Animation|Comedy|Fantasy 4.000000
9738 Animation|Comedy|Fantasy 3.500000
9739 Drama 3.500000
9740 Action|Animation 3.500000
9741 Comedy 4.000000
[9742 rows x 4 columns]