我有一个数据帧,其中包含根据四列(SFDC_ID和left_side和right_SFDC_ID和right_side和相似性)重复的值:
现在,SFDC_ID和right_SFDC_ID正在通过以下方式复制:
SFDC_ID left_side right_SFDC_ID right_side similairity
0013s00000vEVuwAAG Hague Quality Water 0013s00000vEW72AAG Hague Quality Waters 0.99023304
0013s00000vEW72AAG Hague Quality Waters 0013s00000vEVuwAAG Hague Quality Water 0.99023304
仔细观察,第1行的SFDC_ID与第2行的right_SFDC_ID相同。
我如何使用熊猫丢弃第二行?
答案 0 :(得分:2)
这是一种方法:
# compares which is greater based on alphabetical order and makes a bool series
mask = df['SFDC_ID'] < df['right_SFDC_ID']
# creates a new column checking True vs False,
#if mask is true item in df['SFDC_ID'] is selected else item in df['right_SFDC_ID'] is selected
df['col1'] = df['SFDC_ID'].where(mask, df['right_SFDC_ID'])
#same as above but a column for df['right_SFDC_ID']
df['col2'] = df['right_SFDC_ID'].where(mask, df['SFDC_ID'])
# checks for duplicates in `col1` and `col2` and removes last duplicate
df = df.drop_duplicates(subset=['col1', 'col2'])
答案 1 :(得分:0)
您可以遍历行并删除与前几行值匹配的行
for index,row in df[1::].iterrows():
prev_SFDC_ID = df.iloc[index-1]['SFDC_ID'] #get prev SFDC_ID value
if row['right_SFDC_ID'] == prev_SFDC_ID:
df.drop(index=index, inplace=True)