Python / Pandas - 按列值删除重复的行

时间:2017-10-28 19:13:31

标签: python pandas dataframe data-analysis data-science

我有这样的DataFrame:

     sale_id          dt        receipts_qty 
31     196.0  2017-02-19                95.0    
32     203.0  2017-02-20               101.0   
33     196.0  2017-02-21               105.0            
34     196.0  2017-02-22               112.0           
35     196.0  2017-02-23               118.0           
36     196.0  2017-02-24               135.0            
37     196.0  2017-02-25               135.0           
38     196.0  2017-02-26               124.0           
40     203.0  2017-02-27               290.0          
39     196.0  2017-02-27                84.0          
42     203.0  2017-02-28               330.0            
41     196.0  2017-02-28               124.0           
43     196.0  2017-03-01               100.0          
44     203.0  2017-03-01               361.0         

我必须按dt删除重复项,并将行保留在sale_id == 196之内。我发现只有drop_duplicates('dt', keep='last')drop_duplicates('dt', keep='first'),但这不是我需要的。

我希望获得DataFrame:

     sale_id          dt        receipts_qty  
31     196.0  2017-02-19                95.0   
32     203.0  2017-02-20               101.0       
33     196.0  2017-02-21               105.0            
34     196.0  2017-02-22               112.0           
35     196.0  2017-02-23               118.0           
36     196.0  2017-02-24               135.0            
37     196.0  2017-02-25               135.0           
38     196.0  2017-02-26               124.0                 
39     196.0  2017-02-27                84.0                     
41     196.0  2017-02-28               124.0           
43     196.0  2017-03-01               100.0          

1 个答案:

答案 0 :(得分:0)

首先按条件创建辅助列,然后为sort_valuesdrop_duplicates创建第一个值。

上次清洁 - 删除列asort_index

print (df)
    sale_id          dt  receipts_qty
31    196.0  2017-02-19          95.0
32    203.0  2017-02-20         101.0
33    196.0  2017-02-21         105.0
34    196.0  2017-02-22         112.0
35    196.0  2017-02-23         118.0
36    196.0  2017-02-24         135.0
37    196.0  2017-02-25         135.0
38    196.0  2017-02-26         124.0
40    203.0  2017-02-27         290.0
39    196.0  2017-02-27          84.0
42    103.0  2017-02-28         330.0 <-changed data, value < 196
41    196.0  2017-02-28         124.0
43    196.0  2017-03-01         100.0
44    203.0  2017-03-01         361.0
#get only values > 196 
df['a'] = (df.sale_id == 196).astype(int)
#sorting by new column, remove duplicates, remove helper column
df['a'] = (df.sale_id == 196).astype(int)
df = (df.sort_values(['a','dt'], ascending=[False, True])
       .drop_duplicates('dt')
       .drop('a', axis=1)
       .sort_index())
print (df)
    sale_id          dt  receipts_qty
31    196.0  2017-02-19          95.0
32    203.0  2017-02-20         101.0
33    196.0  2017-02-21         105.0
34    196.0  2017-02-22         112.0
35    196.0  2017-02-23         118.0
36    196.0  2017-02-24         135.0
37    196.0  2017-02-25         135.0
38    196.0  2017-02-26         124.0
39    196.0  2017-02-27          84.0
41    196.0  2017-02-28         124.0
43    196.0  2017-03-01         100.0