鉴于groupby()
和nlargest()
的问题,如here和here所述。我正在努力解决这些问题。
注意:为简单起见,我使用了nlargest(1)
,但是,它可以是任意数量的选择。
{'city1': {0: 'Chicago',
1: 'Chicago',
2: 'Chicago',
3: 'Chicago',
4: 'Miami',
5: 'Houston',
6: 'Austin'},
'city2': {0: 'Toronto',
1: 'Detroit',
2: 'St.Louis',
3: 'Miami',
4: 'Dallas',
5: 'Dallas',
6: 'Dallas'},
'p234_r_c': {0: 5.0, 1: 4.0, 2: 2.0, 3: 0.5, 4: 1.0, 5: 4.0, 6: 3.0},
'plant1_type': {0: 'COMBCYCL',
1: 'COMBCYCL',
2: 'NUKE',
3: 'COAL',
4: 'NUKE',
5: 'COMBCYCL',
6: 'COAL'},
'plant2_type': {0: 'COAL',
1: 'COAL',
2: 'COMBCYCL',
3: 'COMBCYCL',
4: 'COAL',
5: 'NUKE',
6: 'NUKE',}}
A)groupby city1
并返回从原始df
cols2 = ['city1','plant1_type','plant2_type']
df.loc[df2.groupby(cols2)['p234_r_c'].nlargest(1).reset_index().level_3]
city1 city2 p234_r_c plant1_type plant2_type
6 Austin Dallas 3.0 COAL NUKE
3 Chicago Miami 0.5 COAL COMBCYCL
0 Chicago Toronto 5.0 COMBCYCL COAL
2 Chicago St.Louis 2.0 NUKE COMBCYCL
5 Houston Dallas 4.0 COMBCYCL NUKE
4 Miami Dallas 1.0 NUKE COAL
以上看起来不错
B)groupby city2
并返回从原始df中选择的行
由于#A中使用的相同代码在尝试city2
的groupby时会产生虚假结果,因此建议采取以下措施:
cols = ['city2','plant1_type','plant2_type']
df.set_index(cols).groupby(level=cols)['p234_r_c'].nlargest(1)
city2 plant1_type plant2_type
Toronto COMBCYCL COAL 5.0
Detroit COMBCYCL COAL 4.0
St.Louis NUKE COMBCYCL 2.0
Miami COAL COMBCYCL 0.5
Dallas NUKE COAL 1.0
COMBCYCL NUKE 4.0
COAL NUKE 3.0
现在我如何使用此结果返回从原始df中选择的行,就像我在#A中一样?
注意:如果原始df有一个额外的行,groupby.nlargest()
city2
的{{1}}个结果组中至少有一个组的大小大于1,那么#A
中的代码可用于#B
。
答案 0 :(得分:2)
除非我遗漏了某些东西(我同意这里有大熊猫代码中存在的漏洞),否则我们可以相对简单地绕过任何困难。
方法#1:使用loc
和idxmax
:
In [21]: df.loc[df.groupby(cols2)["p234_r_c"].idxmax()]
Out[21]:
city1 city2 p234_r_c plant1_type plant2_type
6 Austin Dallas 3.0 COAL NUKE
3 Chicago Miami 0.5 COAL COMBCYCL
0 Chicago Toronto 5.0 COMBCYCL COAL
2 Chicago St.Louis 2.0 NUKE COMBCYCL
5 Houston Dallas 4.0 COMBCYCL NUKE
4 Miami Dallas 1.0 NUKE COAL
In [22]: df.loc[df.groupby(cols)["p234_r_c"].idxmax()]
Out[22]:
city1 city2 p234_r_c plant1_type plant2_type
6 Austin Dallas 3.0 COAL NUKE
5 Houston Dallas 4.0 COMBCYCL NUKE
4 Miami Dallas 1.0 NUKE COAL
1 Chicago Detroit 4.0 COMBCYCL COAL
3 Chicago Miami 0.5 COAL COMBCYCL
2 Chicago St.Louis 2.0 NUKE COMBCYCL
0 Chicago Toronto 5.0 COMBCYCL COAL
方法#2:按p234_r_c
排序并使用last
:
In [17]: df.sort_values("p234_r_c").groupby(cols2, as_index=False).last()
Out[17]:
city1 plant1_type plant2_type city2 p234_r_c
0 Austin COAL NUKE Dallas 3.0
1 Chicago COAL COMBCYCL Miami 0.5
2 Chicago COMBCYCL COAL Toronto 5.0
3 Chicago NUKE COMBCYCL St.Louis 2.0
4 Houston COMBCYCL NUKE Dallas 4.0
5 Miami NUKE COAL Dallas 1.0
In [18]: df.sort_values("p234_r_c").groupby(cols, as_index=False).last()
Out[18]:
city2 plant1_type plant2_type city1 p234_r_c
0 Dallas COAL NUKE Austin 3.0
1 Dallas COMBCYCL NUKE Houston 4.0
2 Dallas NUKE COAL Miami 1.0
3 Detroit COMBCYCL COAL Chicago 4.0
4 Miami COAL COMBCYCL Chicago 0.5
5 St.Louis NUKE COMBCYCL Chicago 2.0
6 Toronto COMBCYCL COAL Chicago 5.0
如果你想能够得到多个响应,虽然nlargest和nsmallest被打破了,我认为最简单的是排序然后使用head或tail。例如:
In [27]: df.sort_values("p234_r_c").groupby(cols, as_index=False).tail(2)
Out[27]:
city1 city2 p234_r_c plant1_type plant2_type
3 Chicago Miami 0.5 COAL COMBCYCL
4 Miami Dallas 1.0 NUKE COAL
2 Chicago St.Louis 2.0 NUKE COMBCYCL
6 Austin Dallas 3.0 COAL NUKE
1 Chicago Detroit 4.0 COMBCYCL COAL
5 Houston Dallas 4.0 COMBCYCL NUKE
0 Chicago Toronto 5.0 COMBCYCL COAL