Question

I'm having a problem filtering out duplicate data based on a key ticker in columns based on conditionals with lowest values(int & dates). So, the initial dataset looks like the following:

    ticker    dim     cal_date   date0        date1    diff
0   A         ART      9/30/16  12/20/16    12/20/17    -81
1   AA        ART      9/30/16   12/1/16     12/1/17    -62
2   AA        ART      9/30/16   12/1/16      2/8/18   -131
3   AA        ART      9/30/16    2/8/17     12/1/17    -62
4   AA        ART      9/30/16    2/8/17      2/8/18   -131
5   AABA      ART      9/30/16   11/9/16     11/9/17    -40
6   AAC       ART      9/30/16   11/8/16     11/8/17    -39
7   AAL       ART      9/30/16  10/20/16    10/20/17    -20
8   AAMC      ART      9/30/16   11/7/16     11/7/17    -38
9   AAME      ART      9/30/16  11/14/16    11/14/17    -45
36  ABMT      ART      9/30/16   2/14/17     2/14/18    -137
37  ABMT      ART      9/30/16   2/14/17     2/16/18    -139
38  ABMT      ART      9/30/16   2/16/17     2/14/18    -137

Notice, value AA is repeated 4 times and the value ABMT is repeated 3 times. I would like to filter out some of the values based on two conditions, the first selects the date0 dates which came first, so now the dataset will look like this:

    ticker    dim     cal_date   date0        date1    diff
0   A         ART      9/30/16   12/20/16   12/20/17    -81
1   AA        ART      9/30/16    12/1/16    12/1/17    -62
2   AA        ART      9/30/16    12/1/16     2/8/18   -131
5   AABA      ART      9/30/16    11/9/16    11/9/17    -40
6   AAC       ART      9/30/16    11/8/16    11/8/17    -39
7   AAL       ART      9/30/16   10/20/16   10/20/17    -20
8   AAMC      ART      9/30/16    11/7/16    11/7/17    -38
9   AAME      ART      9/30/16   11/14/16   11/14/17    -45
36  ABMT      ART      9/30/16    2/14/17    2/14/18    -137
37  ABMT      ART      9/30/16    2/14/17    2/16/18    -139

The second condition is to remove the values with the lowest diff value to get the final result. Now the filtered, complete dataset will look like this:

    ticker    dim     cal_date   date0        date1    diff
0   A         ART      9/30/16   12/20/16   12/20/17    -81
1   AA        ART      9/30/16    12/1/16    12/1/17    -62
5   AABA      ART      9/30/16    11/9/16    11/9/17    -40
6   AAC       ART      9/30/16    11/8/16    11/8/17    -39
7   AAL       ART      9/30/16   10/20/16   10/20/17    -20
8   AAMC      ART      9/30/16    11/7/16    11/7/17    -38
9   AAME      ART      9/30/16   11/14/16   11/14/17    -45
36  ABMT      ART      9/30/16    2/14/17    2/14/18    -137

Thank you for your help.

EDIT:

After Wen's answer, I've update my code to the following:

import pandas as pd
data = pd.read_csv('input_transform.csv')
print(data)

returns:

    Unnamed: 0 ticker  dim cal_date     date0     date1  diff
 0           0      A  ART  9/30/16  12/20/16  12/20/17   -81
 1           1     AA  ART  9/30/16   12/1/16   12/1/17   -62
 2           2     AA  ART  9/30/16   12/1/16    2/8/18  -131
 3           3     AA  ART  9/30/16    2/8/17   12/1/17   -62
 4           4     AA  ART  9/30/16    2/8/17    2/8/18  -131
 5           5   AABA  ART  9/30/16   11/9/16   11/9/17   -40
 6           6    AAC  ART  9/30/16   11/8/16   11/8/17   -39
 7           7    AAL  ART  9/30/16  10/20/16  10/20/17   -20
 8           8   AAMC  ART  9/30/16   11/7/16   11/7/17   -38
 9           9   AAME  ART  9/30/16  11/14/16  11/14/17   -45
10          36   ABMT  ART  9/30/16   2/14/17   2/14/18  -137
11          37   ABMT  ART  9/30/16   2/14/17   2/16/18  -139
12          38   ABMT  ART  9/30/16   2/16/17   2/14/18  -137

Then I add:

# making sure the date is in date format.
data['date0'] = pd.to_datetime(data['date0'].replace("'", ""))
# making sure the diff is in float or int format
data['diff'] = data['diff'].astype(float)

data.sort_values(['date0', 'diff'], ascending=[False, True]).drop_duplicates('ticker', keep='last').sort_index()
print(data)

Which returns:

    Unnamed: 0 ticker  dim cal_date      date0     date1   diff
 0           0      A  ART  9/30/16 2016-12-20  12/20/17  -81.0
 1           1     AA  ART  9/30/16 2016-12-01   12/1/17  -62.0
 2           2     AA  ART  9/30/16 2016-12-01    2/8/18 -131.0
 3           3     AA  ART  9/30/16 2017-02-08   12/1/17  -62.0
 4           4     AA  ART  9/30/16 2017-02-08    2/8/18 -131.0
 5           5   AABA  ART  9/30/16 2016-11-09   11/9/17  -40.0
 6           6    AAC  ART  9/30/16 2016-11-08   11/8/17  -39.0
 7           7    AAL  ART  9/30/16 2016-10-20  10/20/17  -20.0
 8           8   AAMC  ART  9/30/16 2016-11-07   11/7/17  -38.0
 9           9   AAME  ART  9/30/16 2016-11-14  11/14/17  -45.0
10          36   ABMT  ART  9/30/16 2017-02-14   2/14/18 -137.0
11          37   ABMT  ART  9/30/16 2017-02-14   2/16/18 -139.0
12          38   ABMT  ART  9/30/16 2017-02-16   2/14/18 -137.0

So unfortunately, so far, no luck.

Answer 1

Then sort_values + drop_duplicates

df.sort_values(['date0','diff'],ascending=[False,True]).drop_duplicates('ticker',keep='last').sort_index()
Out[1071]: 
   ticker  dim cal_date     date0     date1  diff
0       A  ART  9/30/16  12/20/16  12/20/17   -81
1      AA  ART  9/30/16   12/1/16   12/1/17   -62
5    AABA  ART  9/30/16   11/9/16   11/9/17   -40
6     AAC  ART  9/30/16   11/8/16   11/8/17   -39
7     AAL  ART  9/30/16  10/20/16  10/20/17   -20
8    AAMC  ART  9/30/16   11/7/16   11/7/17   -38
9    AAME  ART  9/30/16  11/14/16  11/14/17   -45
36   ABMT  ART  9/30/16   2/14/17   2/14/18  -137

How can I filter for pandas columns or rows based on values of another column?

1 个答案: