Question

对于具有分组的pandas DataFrame，我想保留所有行，直到第一次出现特定值为止（并丢弃所有其他行）。

MWE：

给予

import pandas as pd
df = pd.DataFrame({'A' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'tmp'],
                   'B' : [0, 1, 0, 0, 0, 1, 0],
                   'C' : [2.0, 5., 8., 1., 2., 9., 7.]})

，我想保留每个组的所有行（A B C 0 foo 0 2.0 1 foo 1 5.0 2 foo 0 8.0 3 bar 0 1.0 4 bar 0 2.0 5 bar 1 9.0 6 tmp 0 7.0是分组变量），直到A（包括该行）。所以，我想要的输出是

B == 1

如何使分组的DataFrage的所有行都符合特定条件？

我找到了how to drop specific groups not meeting a certain criteria (and keeping all other rows of all other groups)，但没有找到如何删除所有组的特定行的方法。我得到的最远的结果是获取每个组中行的索引，我想保留：

    A    B  C
0   foo  0  2.0
1   foo  1  5.0
3   bar  0  1.0
4   bar  0  2.0
5   bar  1  9.0
6   tmp  0  7.0

导致

df.groupby('A').apply(lambda x: x['B'].cumsum().searchsorted(1))

这还不够，因为它不返回实际数据（可能更好，如果对于A bar 2 foo 1 tmp 1，结果为tmp）

Answer 1

在阅读this question关于groupby.apply和groupby.aggregate之间的区别之后，我意识到apply适用于该组的所有列和行（因此是DataFrame？）。所以这是我的功能，应该应用于每个组：

def f(group):
    index = min(group['B'].cumsum().searchsorted(1), len(group))
    return group.iloc[0:index+1]

通过运行df.groupby('A').apply(f)我得到了预期的结果：

            A       B   C
A               
bar     3   bar     0   1.0
        4   bar     0   2.0
        5   bar     1   9.0
foo     0   foo     0   2.0
        1   foo     1   5.0
tmp     6   tmp     0   7.0

如何使分组熊猫DataFrage的所有行都符合特定条件？

1 个答案: