我有一组数据,我希望根据每个“路径”中的“主题”获得最高分。基本上,我正在努力为学生赢得高分。
l1 = ['/some/path/A','/some/path/A','/some/path/A','/some/path/2','/some/path/2']
l2 = ['paper1','paper1','paper1','paper2','paper2']
l3 = ['Tree','Tree','Axe','Leaf','Leaf']
l4 = [.68,.90,.80,.65,.30]
df = pd.DataFrame(
{'path': l1,
'paper_name': l2,
'topic': l3,
'score':l4
})
df
path paper_name topic score
/some/path/A paper1 Tree 0.68
/some/path/A paper1 Tree 0.90
/some/path/A paper1 Axe 0.80
/some/path/2 paper2 Leaf 0.65
/some/path/2 paper2 Leaf 0.30
我已经尝试过这种逻辑:
df[df['topic']=='Tree' & df['score'].max()] # hard coding. need something more simpler not hard coded like the way I am doing it. It also errors out because of data types. How do get this?
path paper_name topic score
/some/path/A paper1 Tree 0.90
/some/path/A paper1 Axe 0.80
/some/path/2 paper2 Leaf 0.65
答案 0 :(得分:1)
您可以sort
和drop_duplicates
df_final = df.sort_values('score').drop_duplicates('topic', keep='last').sort_index()
Out[76]:
paper_name path score topic
1 paper1 /some/path/A 0.90 Tree
2 paper1 /some/path/A 0.80 Axe
3 paper2 /some/path/2 0.65 Leaf
或者groupby.idxmax
和切片
df_final = df.loc[df.groupby('topic', sort=False).score.idxmax()]
Out[82]:
paper_name path score topic
1 paper1 /some/path/A 0.90 Tree
2 paper1 /some/path/A 0.80 Axe
3 paper2 /some/path/2 0.65 Leaf