我将excel中的数据加载到pandas数据帧中。我现在只想选择那些ASSESSMENT ID是每个APPID的最大ASSESSMENT ID以及该APPID的所有UI SEQ NUMBERS的行。
APPID APPNAME ASSESSMENT ID UI SEQ NUMBER QUESTION ANSWER TEXT .
1 appname 2493 11 Question No .
1 appname 13808 11 Question Ctry of domicile .
1 appname 13808 11 Question Name .
1 appname 35316 11 Question Ctry of domicile .
1 appname 35316 11 Question Name .
1 appname 35316 11 Question Nationality .
1 appname 2493 12 Question Corp name .
1 appname 2493 12 Question Cr Br Scr .
1 appname 2493 12 Question Inc And Assests .
1 appname 2493 12 Question Int, Ext Reg Reports .
1 appname 13808 12 Question Corp name .
1 appname 35316 12 Question Corp name .
1 appname 2493 13 Question No .
1 appname 13808 13 Question No .
1 appname 35316 13 Question No .
1 appname 2493 14 Question No .
1 appname 13808 14 Question firms Pos .
1 appname 35316 14 Question firms Pos .
结果将是
APPID APPNAME ASSESSMENT ID UI SEQ NUMBER QUESTION ANSWER TEXT .
1 appname 35316 11 Question Ctry of domicile .
1 appname 35316 11 Question Name .
1 appname 35316 11 Question Nationality .
1 appname 35316 12 Question Corp name .
1 appname 35316 13 Question No .
1 appname 35316 14 Question firms Pos .
答案 0 :(得分:1)
我认为您需要使用apply
创建的掩码boolean indexing
:
df1 = df[df.groupby(['APPID', 'UI SEQ NUMBER'])['ASSESSMENT ID'].apply(lambda x:x==x.max())]
print (df1)
APPID APPNAME ASSESSMENT ID UI SEQ NUMBER QUESTION ANSWER TEXT.
3 1 appname 35316 11 Question Ctry of domicile.
4 1 appname 35316 11 Question Name.
5 1 appname 35316 11 Question Nationality.
11 1 appname 35316 12 Question Corp name.
14 1 appname 35316 13 Question No.
17 1 appname 35316 14 Question firms Pos.
或者,如果不需要所有重复的值,请使用idxmax
:
df1 = df.loc[df.groupby(['APPID', 'UI SEQ NUMBER'])['ASSESSMENT ID'].idxmax()]
print (df1)
APPID APPNAME ASSESSMENT ID UI SEQ NUMBER QUESTION ANSWER TEXT.
3 1 appname 35316 11 Question Ctry of domicile.
11 1 appname 35316 12 Question Corp name.
14 1 appname 35316 13 Question No.
17 1 appname 35316 14 Question firms Pos.