我有以下df概要:
movie id movie title release date IMDb URL genre user id rating
0 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 5 3
1 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 268 2
2 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 276 4
3 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 217 3
4 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 87 4
我正在寻找的是计算'用户ID'和平均'评级'并保持所有其他列完好无损。所以结果将是这样的:
movie id movie title release date IMDb URL genre user id rating
0 2 GoldenEye (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 50 3.75
1 3 Four Rooms (1995) 1-Jan-95 http://us.imdb.com/M/title-exact?GoldenEye%20(... Action|Adventure|Thriller 35 2.34
任何想法如何做到这一点?
由于
答案 0 :(得分:6)
如果所有值都在您聚合的列中,则对于每个组都是相同的,那么您可以通过将它们放入组中来避免连接。
然后将函数字典传递给agg
。如果您将as_index
设置为False
,则按列保持按列分组:
df.groupby(['movie id','movie title','release date','IMDb URL','genre'], as_index=False).agg({'user id':len,'rating':'mean'})
注意len
用于计算