我有这样的DataFrame:
sale_id dt receipts_qty
31 196.0 2017-02-19 95.0
32 203.0 2017-02-20 101.0
33 196.0 2017-02-21 105.0
34 196.0 2017-02-22 112.0
35 196.0 2017-02-23 118.0
36 196.0 2017-02-24 135.0
37 196.0 2017-02-25 135.0
38 196.0 2017-02-26 124.0
40 203.0 2017-02-27 290.0
39 196.0 2017-02-27 84.0
42 203.0 2017-02-28 330.0
41 196.0 2017-02-28 124.0
43 196.0 2017-03-01 100.0
44 203.0 2017-03-01 361.0
我必须按dt
删除重复项,并将行保留在sale_id == 196
之内。我发现只有drop_duplicates('dt', keep='last')
和drop_duplicates('dt', keep='first')
,但这不是我需要的。
我希望获得DataFrame:
sale_id dt receipts_qty
31 196.0 2017-02-19 95.0
32 203.0 2017-02-20 101.0
33 196.0 2017-02-21 105.0
34 196.0 2017-02-22 112.0
35 196.0 2017-02-23 118.0
36 196.0 2017-02-24 135.0
37 196.0 2017-02-25 135.0
38 196.0 2017-02-26 124.0
39 196.0 2017-02-27 84.0
41 196.0 2017-02-28 124.0
43 196.0 2017-03-01 100.0
答案 0 :(得分:0)
首先按条件创建辅助列,然后为sort_values
和drop_duplicates
创建第一个值。
上次清洁 - 删除列a
和sort_index
:
print (df)
sale_id dt receipts_qty
31 196.0 2017-02-19 95.0
32 203.0 2017-02-20 101.0
33 196.0 2017-02-21 105.0
34 196.0 2017-02-22 112.0
35 196.0 2017-02-23 118.0
36 196.0 2017-02-24 135.0
37 196.0 2017-02-25 135.0
38 196.0 2017-02-26 124.0
40 203.0 2017-02-27 290.0
39 196.0 2017-02-27 84.0
42 103.0 2017-02-28 330.0 <-changed data, value < 196
41 196.0 2017-02-28 124.0
43 196.0 2017-03-01 100.0
44 203.0 2017-03-01 361.0
#get only values > 196
df['a'] = (df.sale_id == 196).astype(int)
#sorting by new column, remove duplicates, remove helper column
df['a'] = (df.sale_id == 196).astype(int)
df = (df.sort_values(['a','dt'], ascending=[False, True])
.drop_duplicates('dt')
.drop('a', axis=1)
.sort_index())
print (df)
sale_id dt receipts_qty
31 196.0 2017-02-19 95.0
32 203.0 2017-02-20 101.0
33 196.0 2017-02-21 105.0
34 196.0 2017-02-22 112.0
35 196.0 2017-02-23 118.0
36 196.0 2017-02-24 135.0
37 196.0 2017-02-25 135.0
38 196.0 2017-02-26 124.0
39 196.0 2017-02-27 84.0
41 196.0 2017-02-28 124.0
43 196.0 2017-03-01 100.0