A B C
0 2002-01-13 15:00:00 120
1 2002-01-13 15:30:00 110
2 2002-01-13 16:00:00 130
3 2002-01-13 16:30:00 140
4 2002-01-14 15:00:00 180
5 2002-01-14 15:30:00 165
6 2002-01-14 16:00:00 150
7 2002-01-14 16:30:00 170
我想为每个 A组选择一行,具有以下条件:
输出应为:
A B C
0 2002-01-13 15:00:00 120
5 2002-01-14 15:30:00 165
答案 0 :(得分:2)
正如@Anton vBR评论的那样,首先按每个组的条件删除行,然后按idxmax
的最小C
获取行,并按loc
选择:
df = df[df.groupby('A')['C'].transform(lambda x: x >= x.min() + 10)]
#filtering with transform `min` only
#df = df[df.groupby('A')['C'].transform('min') + 10 <= df['C']]
print (df)
A B C
0 2002-01-13 15:00:00 120
2 2002-01-13 16:00:00 130
3 2002-01-13 16:30:00 140
4 2002-01-14 15:00:00 180
5 2002-01-14 15:30:00 165
7 2002-01-14 16:30:00 170
df = df.loc[df.groupby('A')['C'].idxmin()]
与...相同:
idx=df.sort_values(['A','C']).groupby('A')['C'].apply(lambda x: (x >= x.min() + 10).idxmax())
df = df.loc[idx]
sort_values
与drop_duplicates
的替代解决方案:
df = df.sort_values(['A','C'])
df = df[df.groupby('A')['C'].transform(lambda x: x >= x.min() + 10)].drop_duplicates(['A'])
print (df)
A B C
0 2002-01-13 15:00:00 120
5 2002-01-14 15:30:00 165
答案 1 :(得分:1)
这是一个矢量化解决方案。有时,辅助列比基于lambda
的内联解决方案更有效。
df['Floor'] = df['C'] - (df.groupby('A')['C'].transform('min') + 10)
res = df.loc[df[df['Floor'] >= 0].groupby('A')['Floor'].idxmin()]
结果:
A B C Floor
0 2002-01-13 15:00:00 120 0
5 2002-01-14 15:30:00 165 5