我希望保留数据集中的重复项,使其保留最高值。现在我正在使用熊猫:
c_maxes = hospProfiling.groupby(['Hospital_ID', 'District_ID'], group_keys=False)\
.apply(lambda x: x.ix[x['Hospital_employees'].idxmax()])
print c_maxes
c_maxes.to_csv('data/external/HospitalProfilingMaxes.csv')
执行此操作会导致初始数据集:Hospital_ID,District_ID,Hospital_employees
成为Hospital_ID,District_ID,Hospital_ID,District_ID,Hospital_employees
。
用于分组的列正在重复。这里的错误是什么?
编辑:
使用groupby()函数时,会在数据开头添加一个额外的列。该列没有名称,它只是所有行的序列号。这在问题的输出第二个答案中显示。我想删除这个额外的列,因为我不需要它。我试过这个:
hospProfiling.drop(hospProfiling.columns[0], axis=1)
此代码不会删除列。如何删除它?
答案 0 :(得分:3)
为什么不使用groupby max
方法?
hopsProfiling.groupby(['Hospital_ID','District_ID'],as_index = False).max()
如果碰巧有三列以上,请用agg替换max:
hopsProfiling.groupby(['Hospital_ID','District_ID'],as_index = False).agg({'Hospital employees': max})
答案 1 :(得分:1)
我认为你需要:
hospProfiling.loc[hospProfiling.groupby(['Hospital_ID', 'District_ID'])['Hospital_employees']
.idxmax()]
我对另一个答案感到非常惊讶,如果函数idxmax
没用,我做了一些研究:
样品:
hospProfiling = pd.DataFrame({'Hospital_ID': {0: 'A', 1: 'A', 2: 'B', 3: 'A', 4: 'A', 5: 'B', 6: 'A', 7: 'A', 8: 'B', 9: 'B', 10: 'A', 11: 'B', 12: 'A'}, 'Name': {0: 'Sam', 1: 'Annie', 2: 'Fred', 3: 'Sam', 4: 'Annie', 5: 'Fred', 6: 'Sam', 7: 'Annie', 8: 'Fred', 9: 'James', 10: 'Alan', 11: 'Julie', 12: 'Greg'}, 'District_ID': {0: 'M', 1: 'F', 2: 'M', 3: 'M', 4: 'F', 5: 'M', 6: 'M', 7: 'F', 8: 'M', 9: 'M', 10: 'M', 11: 'F', 12: 'M'}, 'Hospital_employees': {0: 25, 1: 41, 2: 70, 3: 44, 4: 12, 5: 14, 6: 20, 7: 10, 8: 30, 9: 18, 10: 56, 11: 28, 12: 33}, 'Val': {0: 100, 1: 7, 2: 14, 3: 200, 4: 5, 5: 20, 6: 1, 7: 0, 8: 7, 9: 9, 10: 6, 11: 9, 12: 47}})
hospProfiling = hospProfiling[['Hospital_ID','District_ID','Hospital_employees','Val','Name']]
hospProfiling.sort_values(by=['Hospital_ID','District_ID'], inplace=True)
print (hospProfiling)
Hospital_ID District_ID Hospital_employees Val Name
1 A F 41 7 Annie
4 A F 12 5 Annie
7 A F 10 0 Annie
0 A M 25 100 Sam
3 A M 44 200 Sam
6 A M 20 1 Sam
10 A M 56 6 Alan
12 A M 33 47 Greg
11 B F 28 9 Julie
2 B M 70 14 Fred
5 B M 14 20 Fred
8 B M 30 7 Fred
9 B M 18 9 James
主要区别在于如何处理其他列,如果使用max
它会返回每列的最大值 - 此处为Hospital_employees
和Val
:
c_maxes = hospProfiling.groupby(['Hospital_ID','District_ID'],as_index = False).max()
print (c_maxes)
Hospital_ID District_ID Hospital_employees Name Val
0 A F 41 Annie 7
1 A M 56 Sam 200
2 B F 28 Julie 9
3 B M 70 James 20
c_maxes = hospProfiling.groupby(['Hospital_ID','District_ID'],as_index = False)
.agg({'Hospital_employees': max})
print (c_maxes)
Hospital_ID District_ID Hospital_employees
0 A F 41
1 A M 56
2 B F 28
3 B M 70
函数idxmax
返回另一列中最大值的索引:
print (hospProfiling.groupby(['Hospital_ID', 'District_ID'])['Hospital_employees'].idxmax())
A F 1
M 10
B F 11
M 2
Name: Hospital_employees, dtype: int64
然后您只能loc
选择DataFrame
:
c_maxes = hospProfiling.loc[hospProfiling.groupby(['Hospital_ID', 'District_ID'])['Hospital_employees']
.idxmax()]
print (c_maxes)
District_ID Hospital_ID Hospital_employees Name Val
1 F A 41 Annie 7
10 M A 56 Alan 6
11 F B 28 Julie 9
2 M B 70 Fred 14