I'm having a problem filtering out duplicate data based on a key ticker
in columns based on conditionals with lowest values(int
& dates
).
So, the initial dataset looks like the following:
ticker dim cal_date date0 date1 diff
0 A ART 9/30/16 12/20/16 12/20/17 -81
1 AA ART 9/30/16 12/1/16 12/1/17 -62
2 AA ART 9/30/16 12/1/16 2/8/18 -131
3 AA ART 9/30/16 2/8/17 12/1/17 -62
4 AA ART 9/30/16 2/8/17 2/8/18 -131
5 AABA ART 9/30/16 11/9/16 11/9/17 -40
6 AAC ART 9/30/16 11/8/16 11/8/17 -39
7 AAL ART 9/30/16 10/20/16 10/20/17 -20
8 AAMC ART 9/30/16 11/7/16 11/7/17 -38
9 AAME ART 9/30/16 11/14/16 11/14/17 -45
36 ABMT ART 9/30/16 2/14/17 2/14/18 -137
37 ABMT ART 9/30/16 2/14/17 2/16/18 -139
38 ABMT ART 9/30/16 2/16/17 2/14/18 -137
Notice, value AA is repeated 4 times and the value ABMT is repeated 3 times. I would like to filter out some of the values based on two conditions, the first selects the date0 dates which came first, so now the dataset will look like this:
ticker dim cal_date date0 date1 diff
0 A ART 9/30/16 12/20/16 12/20/17 -81
1 AA ART 9/30/16 12/1/16 12/1/17 -62
2 AA ART 9/30/16 12/1/16 2/8/18 -131
5 AABA ART 9/30/16 11/9/16 11/9/17 -40
6 AAC ART 9/30/16 11/8/16 11/8/17 -39
7 AAL ART 9/30/16 10/20/16 10/20/17 -20
8 AAMC ART 9/30/16 11/7/16 11/7/17 -38
9 AAME ART 9/30/16 11/14/16 11/14/17 -45
36 ABMT ART 9/30/16 2/14/17 2/14/18 -137
37 ABMT ART 9/30/16 2/14/17 2/16/18 -139
The second condition is to remove the values with the lowest diff value to get the final result. Now the filtered, complete dataset will look like this:
ticker dim cal_date date0 date1 diff
0 A ART 9/30/16 12/20/16 12/20/17 -81
1 AA ART 9/30/16 12/1/16 12/1/17 -62
5 AABA ART 9/30/16 11/9/16 11/9/17 -40
6 AAC ART 9/30/16 11/8/16 11/8/17 -39
7 AAL ART 9/30/16 10/20/16 10/20/17 -20
8 AAMC ART 9/30/16 11/7/16 11/7/17 -38
9 AAME ART 9/30/16 11/14/16 11/14/17 -45
36 ABMT ART 9/30/16 2/14/17 2/14/18 -137
Thank you for your help.
EDIT:
After Wen's answer, I've update my code to the following:
import pandas as pd
data = pd.read_csv('input_transform.csv')
print(data)
returns:
Unnamed: 0 ticker dim cal_date date0 date1 diff
0 0 A ART 9/30/16 12/20/16 12/20/17 -81
1 1 AA ART 9/30/16 12/1/16 12/1/17 -62
2 2 AA ART 9/30/16 12/1/16 2/8/18 -131
3 3 AA ART 9/30/16 2/8/17 12/1/17 -62
4 4 AA ART 9/30/16 2/8/17 2/8/18 -131
5 5 AABA ART 9/30/16 11/9/16 11/9/17 -40
6 6 AAC ART 9/30/16 11/8/16 11/8/17 -39
7 7 AAL ART 9/30/16 10/20/16 10/20/17 -20
8 8 AAMC ART 9/30/16 11/7/16 11/7/17 -38
9 9 AAME ART 9/30/16 11/14/16 11/14/17 -45
10 36 ABMT ART 9/30/16 2/14/17 2/14/18 -137
11 37 ABMT ART 9/30/16 2/14/17 2/16/18 -139
12 38 ABMT ART 9/30/16 2/16/17 2/14/18 -137
Then I add:
# making sure the date is in date format.
data['date0'] = pd.to_datetime(data['date0'].replace("'", ""))
# making sure the diff is in float or int format
data['diff'] = data['diff'].astype(float)
data.sort_values(['date0', 'diff'], ascending=[False, True]).drop_duplicates('ticker', keep='last').sort_index()
print(data)
Which returns:
Unnamed: 0 ticker dim cal_date date0 date1 diff
0 0 A ART 9/30/16 2016-12-20 12/20/17 -81.0
1 1 AA ART 9/30/16 2016-12-01 12/1/17 -62.0
2 2 AA ART 9/30/16 2016-12-01 2/8/18 -131.0
3 3 AA ART 9/30/16 2017-02-08 12/1/17 -62.0
4 4 AA ART 9/30/16 2017-02-08 2/8/18 -131.0
5 5 AABA ART 9/30/16 2016-11-09 11/9/17 -40.0
6 6 AAC ART 9/30/16 2016-11-08 11/8/17 -39.0
7 7 AAL ART 9/30/16 2016-10-20 10/20/17 -20.0
8 8 AAMC ART 9/30/16 2016-11-07 11/7/17 -38.0
9 9 AAME ART 9/30/16 2016-11-14 11/14/17 -45.0
10 36 ABMT ART 9/30/16 2017-02-14 2/14/18 -137.0
11 37 ABMT ART 9/30/16 2017-02-14 2/16/18 -139.0
12 38 ABMT ART 9/30/16 2017-02-16 2/14/18 -137.0
So unfortunately, so far, no luck.
答案 0 :(得分:3)
Then sort_values
+ drop_duplicates
df.sort_values(['date0','diff'],ascending=[False,True]).drop_duplicates('ticker',keep='last').sort_index()
Out[1071]:
ticker dim cal_date date0 date1 diff
0 A ART 9/30/16 12/20/16 12/20/17 -81
1 AA ART 9/30/16 12/1/16 12/1/17 -62
5 AABA ART 9/30/16 11/9/16 11/9/17 -40
6 AAC ART 9/30/16 11/8/16 11/8/17 -39
7 AAL ART 9/30/16 10/20/16 10/20/17 -20
8 AAMC ART 9/30/16 11/7/16 11/7/17 -38
9 AAME ART 9/30/16 11/14/16 11/14/17 -45
36 ABMT ART 9/30/16 2/14/17 2/14/18 -137