熊猫-删除键列中有重复项但保留行数最少的行

时间:2020-05-20 22:09:52

标签: python pandas duplicates

以下代码用于删除键列中重复的行。但是,我想让该行的nan单元格最少,其中给定键列(ID和日期)有重复的行。

我觉得我的代码草率,是否有任何改进使其更简洁/ pythonic?

#import pandas dataframe and key columns
df = dataframe
keys =['ID','date']
#subset rows that have rows with duplicate keys
dup = df[df.duplicated(subset=keys, keep = False)]
#drop duplicates from original dataframe
df.drop_duplicates(subset=keys, inplace = True, keep = False)
#count nans in row
dup['nancells'] = dup.isnull().sum(axis=1)
#sort values for the next step
dup.sort_values(['ID','date','nancells'],inplace=True)
#rank the duplicates by the number of nans, 0 will contain the least nans
dup['rnk']=dup.groupby(keys)['nancells'].cumcount()
#subset the duplicates and take the rows with the least nans
dup = dup[dup.rnk == 0]
#drop columns created
dup.drop(['nancells','rnk'],axis=1,inplace=True)
#flag if these keys had duplicates removed
dup['duplicate_corrected'] = 1
df['duplicate_corrected'] = 0
#concatenate corrected duplicate rows with the non duplicate rows
df=pd.concat([df,dup],axis=0,ignore_index=True)
del dup

0 个答案:

没有答案