我有一个包含5列的数据框。我正在寻找用户最喜欢该学校的前5名学校
我获得了喜欢度排名前5的学校,但我一直在努力筛选唯一用户。我添加了user_id.nunique()但收到错误
df.groupby('school')['like_id'].count().nlargest(5)
样本数据
school_name Day user_id like_id location_id
Tilden HS Mon 1 1 10
South Shore Tue 2 2 11
Tilden HS Mon 1 3 12
South Shore Wed 3 4 13
Brooklyn Wed 5 5 14
Canarsie Thu 7 6 15
Erasmus Fri 8 7 16
Erasmus Sat 8 8 17
答案 0 :(得分:2)
我相信您需要SeriesGroupBy.nunique
:
s = df.groupby('school_name')['user_id'].nunique().nlargest(5)
print (s)
school_name
South Shore 2
Brooklyn 1
Canarsie 1
Erasmus 1
Tilden HS 1
Name: user_id, dtype: int64
或者如果需要按列组合进行分组并按3列获取唯一值的数量:
s1 = df.groupby(['school_name', 'user_id'])['like_id'].nunique().sum(level=0).nlargest(5)
print (s1)
school_name
Erasmus 2
South Shore 2
Tilden HS 2
Brooklyn 1
Canarsie 1
Name: like_id, dtype: int64
s2 = df.groupby(['school_name', 'like_id'])['user_id'].nunique().sum(level=0).nlargest(5)
print (s2)
school_name
Erasmus 2
South Shore 2
Tilden HS 2
Brooklyn 1
Canarsie 1
Name: user_id, dtype: int64
答案 1 :(得分:1)
首先,我们可以进行枢纽:
df_pivot = df.pivot_table(index='school_name',
columns='user_id',
values='like_id',
aggfunc='count',
fill_value=0)
给出df_pivot
:
user_id 1 2 3 5 7 8
school_name
Brooklyn 0 0 0 1 0 0
Canarsie 0 0 0 0 1 0
Erasmus 0 0 0 0 0 2
South Shore 0 1 1 0 0 0
Tilden HS 2 0 0 0 0 0
然后,通过唯一用户了解最多的信息:
df_pivot.ne(0).sum(1).nlargest(5)
给予:
school_name
South Shore 2
Brooklyn 1
Canarsie 1
Erasmus 1
Tilden HS 1
dtype: int64
或通过like_id
:
df_pivot.sum(1).nlargest(5)
给予:
school_name
Erasmus 2
South Shore 2
Tilden HS 2
Brooklyn 1
Canarsie 1
dtype: int64