我有一个DataFrame看起来像这样
cluster_id,distance,url
0,0.1,abc.com
0,0.05,def.com
0,0.3,xyz.com
1,0.15,aaa.com
1,0.25,bbb.com
1,0.05,ccc.com
我想要做的是找到每个群集的最小距离,并有一个名为centroid_url的新列:
cluster_id,distance,url,centroid_url
0,0.1,abc.com,def.com
0,0.05,def.com,def.com
0,0.3,xyz.com,def.com
1,0.15,aaa.com,ccc.com
1,0.25,bbb.com,ccc.com
1,0.05,ccc.com,ccc.com
我可以想到一些丑陋的方法(对于每个可能的cluster_id,在for循环中找到最小值),但我想知道什么是更优雅的方法。谢谢。
答案 0 :(得分:4)
将sort_values
与drop_duplicates
一起使用,然后map
:
df1 = df.sort_values(['cluster_id','distance']).drop_duplicates('cluster_id')
print (df1)
cluster_id distance url
1 0 0.05 def.com
5 1 0.05 ccc.com
df['centroid_url'] = df['cluster_id'].map(df1.set_index('cluster_id')['url'])
print (df)
cluster_id distance url centroid_url
0 0 0.10 abc.com def.com
1 0 0.05 def.com def.com
2 0 0.30 xyz.com def.com
3 1 0.15 aaa.com ccc.com
4 1 0.25 bbb.com ccc.com
5 1 0.05 ccc.com ccc.com
答案 1 :(得分:3)
IIUC:
In [29]: df['centroid_url'] = df.loc[df.groupby('cluster_id')['distance']
.transform('idxmin'), 'url'] \
.values
In [30]: df
Out[30]:
cluster_id distance url centroid_url
0 0 0.10 abc.com def.com
1 0 0.05 def.com def.com
2 0 0.30 xyz.com def.com
3 1 0.15 aaa.com ccc.com
4 1 0.25 bbb.com ccc.com
5 1 0.05 ccc.com ccc.com