我有这个数据集:
individual cluster choice benchmark_probabilities
0 9710535 0 0 0.008647
1 9710535 2 0 0.012558
2 9710535 2 0 0.013894
3 9710535 1 0 0.030648
4 9710535 1 0 0.020298
5 9710535 1 0 0.021444
6 9710535 1 0 0.014804
7 9710535 5 0 0.163837
8 9710535 5 0 0.085191
9 9710535 2 0 0.013272
10 9710535 2 0 0.014684
11 9710535 2 0 0.006987
12 9710535 2 0 0.007387
13 9710535 2 0 0.008940
14 9710535 3 0 0.027746
15 9710535 3 0 0.017345
16 9710535 3 0 0.015545
17 9710535 4 0 0.007449
18 9710535 3 0 0.013382
19 9710535 4 0 0.011559
20 9710535 3 0 0.013091
21 9710535 4 0 0.006438
22 9710535 4 0 0.006089
23 9710535 4 0 0.007768
24 9710535 4 0 0.007348
25 9710535 2 0 0.001479
26 9710535 5 0 0.054764
27 9710535 5 0 0.065420
28 9710535 5 0 0.098600
29 9710535 5 0 0.067577
30 9710535 6 0 0.002158
31 9710535 6 0 0.002041
32 9710535 6 0 0.001694
33 9710535 6 0 0.001602
34 9710535 7 0 0.010075
35 9710535 7 0 0.008076
36 9710535 7 0 0.004485
37 9710535 7 0 0.009090
38 9710535 7 0 0.005834
39 9710535 5 0 0.018973
40 9710535 7 0 0.014945
41 9710535 7 0 0.007159
42 9710535 6 0 0.001624
43 9710535 6 0 0.001535
44 9710535 5 0 0.048068
45 9710535 7 0 0.003548
46 9710540 0 1 0.018614
47 9710540 0 0 0.006515
48 9710540 0 0 0.004040
49 9710540 1 0 0.005489
我想做的是以下事情:
individual
,cluster
分组,然后选择每个分组的前1个
基于benchmark_probabilities
individual
选择前5个结果individual
的唯一身份少于5个cluster
,
然后根据benchmark_probabilities
填写剩余的空间
而不考虑cluster
。结果应如下所示:
individual cluster choice benchmark_probabilities
0 9710535 1 0 0.030648
1 9710535 5 0 0.163837
2 9710535 3 0 0.027746
3 9710535 8 0 0.015682
4 9710535 11 1 0.050787
5 9710540 0 0 0.004040
6 9710540 1 0 0.005489
7 9710540 0 0 0.006515
8 9710540 0 1 0.018614
我已经完成了以下工作,这使我处于第一阶段和第二阶段,但没有第三阶段:
data.groupby(["individual", "cluster"])["benchmark_probabilities"].nlargest(1).groupby("individual").nlargest(5)
但是结果不是我想要的,而且看起来也很丑:
individual individual cluster
9710535 9710535 5 7 0.163837
11 75 0.050787
1 3 0.030648
3 14 0.027746
8 49 0.015682
9710540 9710540 0 98 0.018614
1 101 0.005489
任何帮助将不胜感激
答案 0 :(得分:1)
我认为您需要DataFrame.sort_values
和GroupBy.head
而不是nlargest
,因为要避免丢失choice
列和更好的性能:
df0 = (data.groupby(["individual", "cluster"])["benchmark_probabilities"].nlargest(1)
.groupby("individual").nlargest(5))
print (df0)
individual individual cluster
9710535 9710535 5 7 0.163837
1 3 0.030648
3 14 0.027746
7 40 0.014945
2 10 0.014684
9710540 9710540 0 46 0.018614
1 49 0.005489
Name: benchmark_probabilities, dtype: float64
df1 = (data.sort_values(['individual','cluster','benchmark_probabilities'],
ascending=[True, True, False])
.groupby(["individual", "cluster"]).head(1)
.sort_values(['individual','benchmark_probabilities'],
ascending=[True, False])
.groupby("individual").head(5))
print (df1)
individual cluster choice benchmark_probabilities
7 9710535 5 0 0.163837
3 9710535 1 0 0.030648
14 9710535 3 0 0.027746
40 9710535 7 0 0.014945
10 9710535 2 0 0.014684
46 9710540 0 1 0.018614
49 9710540 1 0 0.005489
然后仅过滤df1
中未排序的原始行并进行排序:
df2 = (data[~data.index.isin(df1.index)]
.sort_values(['individual','benchmark_probabilities'],
ascending=[True, False])
)
#print (df2)
添加了t df1
,并通过head
获得了top5值:
df = (pd.concat([df1, df2])
.groupby('individual').head(5)
.sort_values('individual'))
print (df)
individual cluster choice benchmark_probabilities
7 9710535 5 0 0.163837
3 9710535 1 0 0.030648
14 9710535 3 0 0.027746
40 9710535 7 0 0.014945
10 9710535 2 0 0.014684
46 9710540 0 1 0.018614
49 9710540 1 0 0.005489
47 9710540 0 0 0.006515
48 9710540 0 0 0.004040