python按特定顺序删除重复项(不是`first`,`last`)

时间:2018-06-05 20:54:00

标签: python pandas

ID  values
111 reason1
111 reason2
111 reason3
222 reason2
222 reason4
222 reason5

df.drop_duplicates(["ID"], keep='???', inplace=True)

我知道的方法是使用drop_duplicates,但它只给我选项firstlast。我想检查是否有reason2,然后保留记录与reason2,否则检查reason3等。基本上,有特定的顺序,如reason2,reason3,reason4等。

2 个答案:

答案 0 :(得分:4)

根据评论,这可以是其中一个实现:(实现@brittenb的想法。)

priority_dict = {
    'reason1':1,
    'reason2':2,
    'reason3':3,
    'reason4':4,
    'reason5':5
}
df['priority'] = df['values'].map(priority_dict)
df = df.sort_values(by=['ID', 'priority'])
df.drop_duplicates(['ID'], keep='first')

输出:

     ID values  priority
0   111 reason1 1
3   222 reason2 2

答案 1 :(得分:0)

使用定义顺序和排序的'category'dtype:

df['values'] = df['values'].astype('category', ordered=True)\
                           .cat.reorder_categories(['reason2',
                                                    'reason3',
                                                    'reason1',
                                                    'reason4',
                                                    'reason5'])

df.sort_values('values').drop_duplicates('ID', keep='first')

输出:

    ID   values
1  111  reason2
3  222  reason2