我正在尝试比较组中的行,如果行中满足条件,我想要保留完整的组,保留最新的行,或保留第一行。
数据框每组只有2个项目。如果该组的第一行的LastFour数字为' 2290'或者,如果以字母“M'如果在第二行中,LastFour列等于0087或0117并且如果NUM!= 6708,那么我想保留两行。这是第一个有条件的。第二个条件是,除了Date列之外,每列的行都相同,然后保留具有最新日期的行。否则,如果这些条件都不满足,则仅保留第一行并删除第二行。
原创df:
<body>
<div id='divParent'></div>
<script src='http://{yourdomain}/bundle.js' />
</body>
预期结果:
KEY CLAIM LastFour NUM Date1 Date2 Code
166 0944 163 0087 30087 3/2/2012 3/5/2012 1
167 0944 164 0087 30087 3/3/2012 3/5/2012 1
225 1413 222 2290 123422 2/10/2012 2/11/2012 1
226 1413 223 0032 123123 2/10/2012 2/11/2012 1
315 1979 312 0025 70025 12/24/2011 1/6/2012 3
316 1979 313 0025 70025 12/24/2011 1/6/2012 3
320 1997 317 0007 140007 1/1/2012 1/4/2012 2
321 1997 318 0007 140007 1/1/2012 1/4/2012 2
我的方法是使用if语句,但我遇到了麻烦。
KEY CLAIM LastFour NUM Date1 Date2 Code Keep
166 0944 163 0087 30087 3/2/2012 3/5/2012 1 FALSE
167 0944 164 0087 30087 3/3/2012 3/5/2012 1 TRUE
225 1413 222 2290 123422 2/10/2012 2/11/2012 1 TRUE
226 1413 223 0032 123123 2/10/2012 2/11/2012 1 TRUE
315 1979 312 0025 70025 12/24/2011 1/6/2012 3 FALSE
316 1979 313 0025 70025 12/24/2011 1/6/2012 3 TRUE
320 1997 317 0007 140007 1/1/2012 1/4/2012 2 FALSE
321 1997 318 0007 140007 1/1/2012 1/4/2012 2 TRUE
我感谢任何帮助。
答案 0 :(得分:0)
我不是这种方法的粉丝,但它完成了工作......
我最后分别创建了三个规则。我针对数据框运行了前两个规则,但第三个规则对前两个规则有一些依赖性,所以我必须在最后运行它。在我运行第三条规则之前,我通过识别在规则1和规则2完成后尚未更改的行,从前一个规则中创建了一个较小的数据帧。这非常令人费解,但完成了工作。
PartA = ((df['KEY'] == df['KEY'].shift(-1)) & ((df['LastFour'].isin(LII_range)) | (df['LastFour'].str.get(0).isin(Begin_with))) & ((df['LastFour'].shift(-1).isin(s_term)) | (df['NUM'].shift(-1).isin(['670899'])))) \
| (((df['LastFour'].shift(1).isin(LII_range)) | (df['LastFour'].shift(1).str.get(0).isin(Begin_with))) & (df['KEY'] == df['KEY'].shift(1)) & ((df['LastFour'].isin(s_term)) | (df['NUM'].isin(['670899']))))
PartB = ((df['KEY'] == df['KEY'].shift(-1)) & (df['Date1'] == df['Date1'].shift(-1)) & (df['Code'] == df['Code'].shift(-1)) & (df['CLM_CD'] == df['CLM_CD'].shift(-1)) & (df['Date2'] != df['Date2'].shift(-1)) & (df['CLAIM'] > df['CLAIM'].shift(-1))) |\
((df['KEY'] == df['KEY'].shift(1)) & (df['Date1'] == df['Date1'].shift(1)) & (df['Code'] == df['Code'].shift(1)) & (df['CLM_CD'] == df['CLM_CD'].shift(1)) & (df['Date2'] != df['Date2'].shift(1)) & (df['CLAIM_NO'] > df['CLAIM_NO'].shift(1)))
PartC = ((df['KEY'].isin(C_Ids)) & ((df['KEY'] == df['KEY'].shift(-1)) & (df['CLAIM'] > df['CLAIM'].shift(-1))))|\
((df['KEY'].isin(C_Ids)) & ((df['KEY'] == df['KEY'].shift(1)) & (df['CLAIM_NO'] > df['CLAIM_NO'].shift(1))))
df['Keep'] = PartA | PartB # run through the first two rules
#Find the all unique rows that where 'Keep'==False and where 'Keep'==True, then subtract lists and store rows in list into a dataframe. Apply PartC rule on this dataframe.
C_F_Ids = df[df['Keep']==False]['KEY'].unique()
C_T_Ids = df[df['Keep']==True]['KEY'].unique()
C_Ids = [item for item in C_F_Ids if item not in C_T_Ids]
#Rerun all the rules to cleanup all rows
df['Keep'] = PartA | PartB | PartC
最终输出:
KEY CLAIM LastFour NUM Date1 Date2 Code Keep
166 0944 163 0087 30087 3/2/2012 3/5/2012 1 FALSE
167 0944 164 0087 30087 3/3/2012 3/5/2012 1 TRUE
225 1413 222 2290 123422 2/10/2012 2/11/2012 1 TRUE
226 1413 223 0032 123123 2/10/2012 2/11/2012 1 TRUE
315 1979 312 0025 70025 12/24/2011 1/6/2012 3 FALSE
316 1979 313 0025 70025 12/24/2011 1/6/2012 3 TRUE
320 1997 317 0007 140007 1/1/2012 1/4/2012 2 FALSE
321 1997 318 0007 140007 1/1/2012 1/4/2012 2 TRUE