如下所示的Excel电子表格(注意:ID列A具有重复的值)。我想找出每个Contract_type的总和,以每个ID仅被计数一次(唯一)。
data = {'ID': ["380689","380689","480562","480562","480562","14805","47089","56251","56251","56251","322624","322624","322624","85964","85964","85964","342225","342225","4589","23591","23591","235225"],
'Contract_type' : ["Other","Other","Type-I","Type-I","Type-I","Type-II","Type-II","Type-II","Type-II","Type-II","Type-II","Type-II","Type-II","Type-III","Type-III","Type-III","Part-time","Part-time","Part-time","Full-time","Full-time","Full-time"],
'Unit_Weight': [335,335,119,119,119,119,52,452,452,452,19,19,19,165,165,165,165,165,165,724,724,16],
'Test_time' : ["16:26","07:39","18:48","22:32","03:54","03:30","09:57","18:52","19:03","18:06","18:52","03:51","04:00","22:02","13:35","13:43","10:29","06:30","12:20","12:52","17:30","13:10"],
'Tested' : [1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0],
'Internal' : [1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1]}
df = pd.DataFrame(data)
我尝试过:
print pd.pivot_table(df, index = ["Contract_type", "ID"]).Unit_Weight
它给出:
Contract_type ID
Full-time 23591 724
235225 16
Other 380689 335
....
但是我只希望它显示类似的内容:全日制740等。
我也尝试过:
print pd.pivot_table(df, index = ["Contract_type"], values=["Unit_Weight"], aggfunc = np.sum)
它给出:
Full-time 1464 # this is not considering the duplicated IDs
正确的路线是什么?谢谢。
答案 0 :(得分:3)
您似乎只想将每对(ID,合同类型)对考虑一次,所以我认为df.groupby(['Contract_type', 'ID]).Unit_Weight.sum()
无效。
您可以尝试:
df.drop_duplicates(['Contract_type', 'ID']).groupby('Contract_type').Unit_Weight.sum()
答案 1 :(得分:2)
我认为需要:
df1 = (df.drop_duplicates(['Contract_type', 'ID'])
.set_index('Contract_type')['Unit_Weight']
.sum(level=0)
.reset_index())
print (df1)
Contract_type Unit_Weight
0 Other 335
1 Type-I 119
2 Type-II 642
3 Type-III 165
4 Part-time 330
5 Full-time 740