Question

我想在Sales列上执行过滤，这样对于任何Make-Auction组，至少应该有一个销售额> = 100.因此对于Acura，Copart有101个销售额，因此Acura的两个行都将在输出中出现。对于宝马来说，Copart和IAA的销售额均<100，因此将被过滤掉。

数据帧：

Make    Auction Sales   
Acura   Copart  101
Acura   IAA     88  
BMW     Copart  50  
BMW     IAA     60
Buick   Copart  130 
Buick   IAA     140

预期产出：

Make    Auction Sales   
Acura   Copart  101
Acura   IAA         88  
Buick   Copart  130 
Buick   IAA     140

我可以在整个Sales列上应用＆gt;过滤器，但这不是我想要的。有关如何执行此操作的任何建议？谢谢！

Answer 1

过滤销售＆gt; = 100的记录的数据框，然后获取唯一的Make个汽车。最后，如果任何make在此过滤集中，则使用布尔索引。

>>> df[df['Make'].isin(df[df['Sales'] >= 100]['Make'].unique())]
    Make Auction  Sales
0  Acura  Copart    101
1  Acura     IAA     88
4  Buick  Copart    130
5  Buick     IAA    140

Answer 2

使用filtration：

df = df.groupby('Make').filter(lambda x: x['Sales'].ge(100).any())
print (df)
    Make Auction  Sales
0  Acura  Copart    101
1  Acura     IAA     88
4  Buick  Copart    130
5  Buick     IAA    140

另一个loc和boolean indexing isin的Make值的解决方案，按{{3}}过滤：

print (df.loc[df['Sales'] >= 100, 'Make'])
0    Acura
4    Buick
5    Buick
Name: Make, dtype: object

print (df['Make'].isin(df.loc[df['Sales'] >= 100, 'Make']))
0     True
1     True
2    False
3    False
4     True
5     True
Name: Make, dtype: bool

df = df[df['Make'].isin(df.loc[df['Sales'] >= 100, 'Make'])]
print (df)
    Make Auction  Sales
0  Acura  Copart    101
1  Acura     IAA     88
4  Buick  Copart    130
5  Buick     IAA    140

第二种解决方案更快：

np.random.seed(123)
N = 1000000
L = list('abcdefghijklmno')
df = pd.DataFrame({'Make': np.random.choice(L, N),
                   'Sales':np.random.randint(110, size=N)})
print (df)

In [59]: %timeit df[df['Make'].isin(df.loc[df['Sales'] >= 100, 'Make'])]
10 loops, best of 3: 55.6 ms per loop

#Alexander answer
In [60]: %timeit df[df['Make'].isin(df[df['Sales'] >= 100]['Make'].unique())]
10 loops, best of 3: 65 ms per loop

In [61]: %timeit df.groupby('Make').filter(lambda x: x['Sales'].ge(100).any())
1 loop, best of 3: 217 ms per loop

#piRSquared solution 1
In [62]: %timeit df[df.Sales.ge(100).groupby([df.Make]).transform('any')]
1 loop, best of 3: 135 ms per loop

#piRSquared solution 2
In [63]: %%timeit
    ...: f, u = pd.factorize(df.Make.values)
    ...: w = df.Sales.values >= 100
    ...: df[(np.bincount(f, w) > 0)[f]]
    ...: 
10 loops, best of 3: 67.2 ms per loop

Answer 3

选项1
transform
如果组中的任何元素为'any'，则使用True返回True，并在组中的所有索引中广播它。

df[df.Sales.ge(100).groupby([df.Make]).transform('any')]

    Make Auction  Sales
0  Acura  Copart    101
1  Acura     IAA     88
4  Buick  Copart    130
5  Buick     IAA    140

选项2
pd.factorize + np.bincount
我们使用np.bincount来增加pd.factorize的二进制数，其中真值由df.Sales.values >= 100确定。如果bin大于0，那么我们应该获取该bin定义的组中的每个元素。我们可以通过f再次切片来获得相应的数组。

这与选项1非常类似。

f, u = pd.factorize(df.Make.values)
w = df.Sales.values >= 100
df[(np.bincount(f, w) > 0)[f]]

    Make Auction  Sales
0  Acura  Copart    101
1  Acura     IAA     88
4  Buick  Copart    130
5  Buick     IAA    140

基于Pandas Python中的分组列值执行条件筛选

3 个答案: