Pandas:删除重复记录,同时在数据帧中保留其旧值以供参考

时间:2016-09-29 16:17:07

标签: python pandas

我正在使用pandas重写一段旧代码。我的数据框如下所示:

index stop_id   stop_name   stop_lat     stop_lon  stop_id2
0         A12     Some St  40.889248   -73.898583      None
1         A14     Some St  40.889758   -73.908573      None
2         B09     Some St  40.788924   -74.846576      None
3         A22     Some St  40.889248   -73.898583      None

请注意,对于stop_ids'A12'和'A22',stop_lat和stop_lon是重复的。

我希望在删除记录的stop_id更新stop_d2时删除重复停止(stop_id ='A22')。所以数据框看起来像这样:

index stop_id   stop_name   stop_lat     stop_lon  stop_id2
0         A12     Some St  40.889248   -73.898583      A22
1         A14     Some St  40.889758   -73.908573      None
2         B09     Some St  40.788924   -74.846576      None

以前我已经完成了这项任务,我的数据保存在字典中:

d={'A12':['Some St', 40.889248, -73.898583, None],'A14': ['Some St', 40.889758, -73.908573, None],'B09':['Some St, 40.788924,-74.846576, None], 'A22':['Some St', 40.889248, -73.898583, None]}

if d['A12'][1]+d['A12'][2]==d['A22'][1]+d['A22'][2]:
   del d['A22']
   d['A12'][-1]='A22'

我想在熊猫中做类似的任务。我知道我是否只使用:     DF = df.drop_duplicates([ 'stop_lat', 'stop_lon'])

我将丢失重复记录,并且不会保留其id。我需要保留已删除的停止的ID以获取正确的元数据。

2 个答案:

答案 0 :(得分:1)

new_df = df[df.duplicated(subset = ['stop_lat', 'stop_lon'], keep='first')]

duplicates_df = df[df.duplicated(subset = ['stop_lat', 'stop_lon'], keep = 'last')][['stop_lat', 'stop_lon', 'stop_id']]

new_df.merge(duplicates_df, how='left', on=['stop_lat, 'stop_lon'])

答案 1 :(得分:1)

获取重复的掩码

cols = ['stop_lat', 'stop_lon']
dups = df.duplicated(subset=cols)
带掩码的

子集df

nodups = df[~dups].set_index(cols)

重复可能会自行复制

first_dup = df[dups].drop_duplicates(subset=cols)
first_dup = first_dup.set_index(cols).stop_id

相应地分配

nodups.loc[first_dup.index, 'stop_id2'] = first_dup
nodups

enter image description here