在pandas DataFrame中找到col X中的最小值,同时按col Y

时间:2017-12-11 19:44:55

标签: python pandas

我有一个DataFrame看起来像这样

cluster_id,distance,url
0,0.1,abc.com
0,0.05,def.com
0,0.3,xyz.com
1,0.15,aaa.com
1,0.25,bbb.com
1,0.05,ccc.com

我想要做的是找到每个群集的最小距离,并有一个名为centroid_url的新列:

cluster_id,distance,url,centroid_url
0,0.1,abc.com,def.com
0,0.05,def.com,def.com
0,0.3,xyz.com,def.com
1,0.15,aaa.com,ccc.com
1,0.25,bbb.com,ccc.com
1,0.05,ccc.com,ccc.com

我可以想到一些丑陋的方法(对于每个可能的cluster_id,在for循环中找到最小值),但我想知道什么是更优雅的方法。谢谢。

2 个答案:

答案 0 :(得分:4)

sort_valuesdrop_duplicates一起使用,然后map

df1 = df.sort_values(['cluster_id','distance']).drop_duplicates('cluster_id')
print (df1)
   cluster_id  distance      url
1           0      0.05  def.com
5           1      0.05  ccc.com

df['centroid_url'] = df['cluster_id'].map(df1.set_index('cluster_id')['url'])
print (df)
   cluster_id  distance      url centroid_url
0           0      0.10  abc.com      def.com
1           0      0.05  def.com      def.com
2           0      0.30  xyz.com      def.com
3           1      0.15  aaa.com      ccc.com
4           1      0.25  bbb.com      ccc.com
5           1      0.05  ccc.com      ccc.com

答案 1 :(得分:3)

IIUC:

In [29]: df['centroid_url'] = df.loc[df.groupby('cluster_id')['distance']
                                       .transform('idxmin'), 'url'] \
                                .values

In [30]: df
Out[30]:
   cluster_id  distance      url centroid_url
0           0      0.10  abc.com      def.com
1           0      0.05  def.com      def.com
2           0      0.30  xyz.com      def.com
3           1      0.15  aaa.com      ccc.com
4           1      0.25  bbb.com      ccc.com
5           1      0.05  ccc.com      ccc.com