我有两个数据框,一个是喜欢歌曲的客户,另一个数据框是用户及其群集。
DATA 1:
user cluster
A 1
B 2
C 1
D 2
E 1
DATA 2:
cluster songs
1 11, 22, 33
2 22,33, 44
我收到了群集收听的所有歌曲,如下所示。
user song
A [33]
B [44]
C [11,22]
D [22]
E [22,33]
我希望它能够输出该特定群集的用户未听过的歌曲。
预期输出:
import '@blueprintjs/core/dist/blueprint.css';
答案 0 :(得分:2)
使用
In [861]: df1.groupby(df1.user.map(df2.set_index('user')['cluster']))['song'].unique()
Out[861]:
user
1 [11, 22, 33]
2 [22, 33]
Name: song, dtype: object
或者
In [857]: df1.groupby(df1.user.map(df2.set_index('user')['cluster']))['song'].agg(
lambda x: ', '.join(x.unique().astype(str)))
Out[857]:
user
1 11, 22, 33
2 22, 33
Name: song, dtype: object
答案 1 :(得分:2)
使用{+ 3}}与左连接和merge
:
df = pd.merge(df1, df2, on='user', how='left').drop_duplicates(['cluster','song'])
print (df)
user song cluster
0 A 11 1
1 A 22 1
2 B 22 2
3 B 33 2
5 C 33 1
然后汇总join
,但之前必须将songs
转换为字符串:
df = df['song'].astype(str).groupby(df['cluster']).apply(', '.join).reset_index()
print (df)
cluster song
0 1 11, 22, 33
1 2 22, 33
或者如果需要list
s:
df = df.groupby('cluster')['song'].apply(list).reset_index()
#same as
#df = df['song'].groupby(df['cluster']).apply(list).reset_index()
print (df)
cluster song
0 1 [11, 22, 33]
1 2 [22, 33]
编辑:
df = pd.merge(df1, df2, on='user', how='left').drop_duplicates(['user','song'])
df1 = df.pivot('user','song', 'cluster')
df3 = df1.isnull().stack().reset_index(name='val')
df3 = df3[df3['val']].groupby('user')['song'].apply(list).reindex(df2['user'])
print (df3)
user
A [33]
B [11]
C [22]
D [11]
E [22, 33]
Name: song, dtype: object
答案 2 :(得分:1)
使用map
+ groupby
+ unique
这是一个非常有效的解决方案:
mapper = df1.user.map(df2.set_index('user').cluster)
df1.song.groupby(mapper).unique()
user
1 [11, 22, 33]
2 [22, 33]
Name: song, dtype: object
获取每个群集的值列表。
答案 3 :(得分:0)
在一行
data1.merge(data2,left_on='ser',right_on='user').groupby('cluster').song.unique()
输出:
cluster
1 [11, 22, 33]
2 [22, 33]
Name: song, dtype: object