Alabama 1 Byrne,Bradley 68.16 68.16 0.0 LeFlore,Burton R. 31.71 31.71 0.0 未知0.13 0.13 0.0
我有一个类似的数据集:
STATE | DISTRICT | CANDIDATE NAME | GENERAL VOTE
Alabama | 1 | Byrne, Bradley | 68.16
Alabama | 1 | LeFlore, Burton R. | 31.71
Alabama | 1 | Unknown | 0.13
Alabama | 2 | Name | 65.43
Alabama | 2 | Name | 0.13
我必须按州和地区分组,因为每个州都有多个地区,而且有很多州。我已经这样做了。 但是,我需要找到每个分组的最大值,并显示与此最大值一致的候选名称。我还必须显示每个分组中最大和最小一般投票之间的差异。我已经做了一些但是我被困了
df_out = dfworking.groupby(["STATE", "D", "CANDIDATE NAME"])['GENERAL PERCENT'].agg(['max','min'])
df_out['Margin'] = df_out['max']-df_out['min']
df_new_out = dfworking.groupby(['STATE','D'])['GENERAL PERCENT'].max()
我不确定如何显示边距列,以及与同一数据框中的最大投票一致的名称。 谢谢!
答案 0 :(得分:2)
注意 - 必须先对STATE
,DISTRICT
和GENERAL VOTE
列中的值进行排序。
#sorting
dfworking = dfworking.sort_values(['STATE','DISTRICT','GENERAL VOTE'],
ascending=[True, True, False])
#get index of max value in GENERAL VOTE column
df1 = dfworking.groupby(["STATE", "DISTRICT"])['GENERAL VOTE'].idxmax()
#create new column - not matched value return NaN
dfworking['cand'] = dfworking.loc[df1, 'CANDIDATE NAME']
#replace NaN by forward filling
dfworking['cand'] = dfworking['cand'].ffill()
print (dfworking)
STATE DISTRICT CANDIDATE NAME GENERAL VOTE cand
0 Alabama 1 Byrne, Bradley 68.16 Byrne, Bradley
1 Alabama 1 LeFlore, Burton R. 31.71 Byrne, Bradley
2 Alabama 1 Unknown 0.13 Byrne, Bradley
3 Alabama 2 Name 65.43 Name
4 Alabama 2 Name 0.13 Name
另一个解决方案是创建df
与最高候选人join
与原始人:
df1 = dfworking.loc[dfworking.groupby(["STATE", "DISTRICT"])['GENERAL VOTE'].idxmax()]
df1 = df1.set_index(['STATE','DISTRICT'])['CANDIDATE NAME'].rename('cand')
dfworking = dfworking.join(df1, on=['STATE','DISTRICT'])
print (dfworking)
STATE DISTRICT CANDIDATE NAME GENERAL VOTE cand
0 Alabama 1 Byrne, Bradley 68.16 Byrne, Bradley
1 Alabama 1 LeFlore, Burton R. 31.71 Byrne, Bradley
2 Alabama 1 Unknown 0.13 Byrne, Bradley
3 Alabama 2 Name 65.43 Name
4 Alabama 2 Name 0.13 Name