样本DF:
ID Name Price Date Fit_Test
1 Apple 10 2018-01-15 Super_Fit
2 Apple 10 2018-01-15 Super_Fit
3 Apple 10 2019-01-15 Super_Fit
4 Orange 12 2019-02-15 Not_Fit
5 Orange 12 2018-09-25 Not_Fit
6 Orange 12 NaT Not_Fit
7 Orange 12 2028-01-25 Not_Fit
8 Banana 15 2019-12-25 Medium_Fit
9 Banana 15 NaT Medium_Fit
10 Cherry 30 2021-06-23 Super_Fit
11 PineAp 30 2023-02-03 Medium_Fit
12 PineAp 30 2020-12-13 Medium_Fit
预期DF:
ID Name Price Date Fit_Test
1 Apple 10 2018-01-15 Super_Fit
2 Apple 10 2018-01-15 Super_Fit
3 Apple 10 2019-01-15 Super_Fit
7 Orange 12 2028-01-25 Not_Fit
8 Banana 15 2019-12-25 Medium_Fit
9 Banana 15 NaT Medium_Fit
10 Cherry 30 2021-06-23 Super_Fit
11 PineAp 30 2023-02-03 Medium_Fit
问题陈述:
我想用group-by
和Name
Price
,然后基于Date
并以Fit_Test
作为条件列进行过滤。
如果Fit_Test为Super_Fit
,则不需要no操作。 (行1,2,3和10在输入和期望DF中是相同的)
如果在Name
和Price
条件下且Fit_Test不是Super_Fit
并且该组中没有NaT
,则比较日期,以日期为准保留最高日期 (ID 11&12和“预期12”中的ID已删除)
如果在Name
和Price
条件内且Fit_Test不是Super_Fit
,并且该组中有一个NaT
:
3.1如果该组中的计数大于2,则比较日期,以最高日期为准 (ID 4,5,6,7和预期4 ,5,6已删除)
3.2如果该组中的计数等于2,则保留两行** (ID为8,9,预期的ID为8,9)**
答案 0 :(得分:2)
使用:
df['Date'] = pd.to_datetime(df['Date'])
m1 = df['Fit_Test'].eq('Super_Fit').groupby([df['Name'],df['Price']]).transform('all')
m2 = df['Date'].notna().groupby([df['Name'],df['Price']]).transform('all')
m22 = df['Date'].eq(df.groupby(['Name', 'Price'])['Date'].transform('max'))
m3 = df.groupby(['Name', 'Price'])['Date'].transform('size').eq(2)
df = df[m1 | (m2 & m22) | (~m2 & m3) | (~m2 & m22)]
#it seems conditions should be simplify
#df = df[m1 | m22 | (~m2 & m3)]
print (df)
ID Name Price Date Fit_Test
0 1 Apple 10 2018-01-15 Super_Fit
1 2 Apple 10 2018-01-15 Super_Fit
2 3 Apple 10 2019-01-15 Super_Fit
6 7 Orange 12 2028-01-25 Not_Fit
7 8 Banana 15 2019-12-25 Medium_Fit
8 9 Banana 15 NaT Medium_Fit
9 10 Cherry 30 2021-06-23 Super_Fit
10 11 PineAp 30 2023-02-03 Medium_Fit