我正在使用pandas重写一段旧代码。我的数据框如下所示:
index stop_id stop_name stop_lat stop_lon stop_id2
0 A12 Some St 40.889248 -73.898583 None
1 A14 Some St 40.889758 -73.908573 None
2 B09 Some St 40.788924 -74.846576 None
3 A22 Some St 40.889248 -73.898583 None
请注意,对于stop_ids'A12'和'A22',stop_lat和stop_lon是重复的。
我希望在删除记录的stop_id更新stop_d2时删除重复停止(stop_id ='A22')。所以数据框看起来像这样:
index stop_id stop_name stop_lat stop_lon stop_id2
0 A12 Some St 40.889248 -73.898583 A22
1 A14 Some St 40.889758 -73.908573 None
2 B09 Some St 40.788924 -74.846576 None
以前我已经完成了这项任务,我的数据保存在字典中:
d={'A12':['Some St', 40.889248, -73.898583, None],'A14': ['Some St', 40.889758, -73.908573, None],'B09':['Some St, 40.788924,-74.846576, None], 'A22':['Some St', 40.889248, -73.898583, None]}
if d['A12'][1]+d['A12'][2]==d['A22'][1]+d['A22'][2]:
del d['A22']
d['A12'][-1]='A22'
我想在熊猫中做类似的任务。我知道我是否只使用: DF = df.drop_duplicates([ 'stop_lat', 'stop_lon'])
我将丢失重复记录,并且不会保留其id。我需要保留已删除的停止的ID以获取正确的元数据。
答案 0 :(得分:1)
new_df = df[df.duplicated(subset = ['stop_lat', 'stop_lon'], keep='first')]
duplicates_df = df[df.duplicated(subset = ['stop_lat', 'stop_lon'], keep = 'last')][['stop_lat', 'stop_lon', 'stop_id']]
new_df.merge(duplicates_df, how='left', on=['stop_lat, 'stop_lon'])
答案 1 :(得分:1)