A B C
0 2002-01-12 Sarah 39
1 2002-01-12 John 17
2 2002-01-12 Susan 30
3 2002-01-15 Danny 12
4 2002-01-15 Peter 25
5 2002-01-15 John 25
6 2002-01-20 John 16
7 2002-01-20 Hung 10
8 2002-02-20 John 20
9 2002-02-20 Susan 40
10 2002-02-24 Rebel 40
11 2002-02-24 Susan 15
12 2002-02-24 Mark 38
13 2002-02-24 Susan 30
我想选择包含A
和John
的完整Susan
个群组。
输出应为:
A B C
0 2002-01-12 Sarah 39
1 2002-01-12 John 17
2 2002-01-12 Susan 30
6 2002-01-20 John 16
7 2002-01-20 Hung 10
8 2002-02-20 John 20
9 2002-02-20 Susan 40
我试过了:
df.groupby('A').apply(lambda x: ((df.B == x.John) & (df.B == x.Susan)))
答案 0 :(得分:2)
创建一个日期数组,作为包含John
&的日期的交集。包含Susan
的日期:
dates = np.intersect1d(
df.A.values[df.B.values == 'John'],
df.A.values[df.B.values == 'Susan']
)
然后使用日期数组过滤数据框
df[df.A.isin(dates)]
# outputs:
A B C
0 2002-01-12 Sarah 39
1 2002-01-12 John 17
2 2002-01-12 Susan 30
8 2002-02-20 John 20
9 2002-02-20 Susan 40
基于numpy的解决方案的效率是其他解决方案的几倍。
In [288]: def hal(df):
...: dates = np.intersect1d(
...: df.A.values[df.B.values == 'John'],
...: df.A.values[df.B.values == 'Susan']
...: )
...: return df[df.A.isin(dates)]
...:
In [289]: def jpp(df):
...: s = df.groupby('A')['B'].apply(set)
...: return df[df['A'].map(s) >= {'John', 'Susan'}]
...:
In [290]: def alollz(df):
...: flag = df.groupby('A').B.transform(lambda x: ((x=='Susan').any() & (x == 'John').any()).sum().astype('boo
...: l'))
...: return df[flag==True]
...:
In [291]: %timeit hal(df)
394 µs ± 6.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [292]: %timeit jpp(df)
1.46 ms ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [293]: %timeit alollz(df)
4.9 ms ± 75 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
然而,ALollz提出的解决方案可以通过省略一些额外的不需要的操作并下到numpy数组进行比较来加速2倍。
In [294]: def alollz_improved(df):
...: v = df.groupby('A').B.transform(lambda x: (x.values=='Susan').any() & (x.values=='John').any())
...: return df[v]
...:
In [295]: %timeit alollz_improved(df)
2.2 ms ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
答案 1 :(得分:1)
您可以使用groupby
+ transform
为满足该条件的组创建标记。然后你可以用那个标志掩盖原来的df
。如果您不想修改原始df
,则可以创建名为Series
的单独flag
,否则您也可以将其分配到原始df
中的列
import pandas as pd
# As Haleemur Ali points out, use x.values to make it faster
flag = df.groupby('A').B.transform(lambda x: (x.values == 'Susan').any() & (x.values == 'John').any())
然后您可以过滤df
df[flag]
# A B C
#0 2002-01-12 Sarah 39
#1 2002-01-12 John 17
#2 2002-01-12 Susan 30
#8 2002-02-20 John 20
#9 2002-02-20 Susan 40
答案 2 :(得分:1)
创建一个系列,将每个日期映射到set
个名称。然后通过语法糖>=
使用set.issuperset
:
s = df.groupby('A')['B'].apply(set)
res = df[df['A'].map(s) >= {'John', 'Susan'}]
print(res)
A B C
0 2002-01-12 Sarah 39
1 2002-01-12 John 17
2 2002-01-12 Susan 30
8 2002-02-20 John 20
9 2002-02-20 Susan 40