我的数据看起来像这样:
ID# DATE TEXT
1 1/1/2017 ENTERED BY A
1 1/1/2017 BLAH BLAH BLAH
1 1/2/2017 ENTERED BY B
1 1/2/2017 BLAH BLAH BLAH
1 1/2/2017 BLAH BLAH BLAH
2 1/4/2017 SUPPLEMENTAL PAYMENT BY A
2 1/4/2017 BLAH BLAH BLAH
3 1/1/2017 ENTERED BY C
3 1/2/2017 CHANGED COMPANY NAME
3 1/2/2017 BLAH BLAH BLAH
我正在尝试按ID#和DATE对数据进行分组,并在组(在此我对案例ID#和DATE进行分组)具有文本匹配时返回所有行。
这是我到目前为止所得到的。下面的代码试图在TEXT字段中搜索子字符串'ENTERED BY'的每一行,并返回与该组关联的所有行。
notes[notes.groupby('ID#','DATE',as_index=False).apply(lambda x: x['TEXT'].str.contains('ENTERED BY'))]
我也尝试过group.filter()的变体,结果相似。有谁能指出我正确的方向?我的输出集应如下所示:
ID# DATE TEXT
1 1/1/2017 ENTERED BY A
1 1/1/2017 BLAH BLAH BLAH
1 1/2/2017 ENTERED BY B
1 1/2/2017 BLAH BLAH BLAH
1 1/2/2017 BLAH BLAH BLAH
3 1/1/2017 ENTERED BY C
谢谢!
答案 0 :(得分:3)
您可以将groupby
+ transform
与any
一起使用,然后按boolean indexing
进行过滤:
df=df[df['TEXT'].str.contains('ENTERED BY').groupby([df['ID#'],df['DATE']]).transform('any')]
print (df)
ID# DATE TEXT
0 1 1/1/2017 ENTERED BY A
1 1 1/1/2017 BLAH BLAH BLAH
2 1 1/2/2017 ENTERED BY B
3 1 1/2/2017 BLAH BLAH BLAH
4 1 1/2/2017 BLAH BLAH BLAH
7 3 1/1/2017 ENTERED BY C
详情:
print (df['TEXT'].str.contains('ENTERED BY'))
0 True
1 False
2 True
3 False
4 False
5 False
6 False
7 True
8 False
9 False
Name: TEXT, dtype: bool
print(df['TEXT'].str.contains('ENTERED BY').groupby([df['ID#'],df['DATE']]).transform('any'))
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 True
8 False
9 False
Name: TEXT, dtype: bool
另一个更快的解决方案是使用drop_duplicates
和merge
过滤所有ID#
和DATE
:
df=df.loc[df['TEXT'].str.contains('ENTERED BY'), ['ID#','DATE']].drop_duplicates().merge(df)
print (df)
ID# DATE TEXT
0 1 1/1/2017 ENTERED BY A
1 1 1/1/2017 BLAH BLAH BLAH
2 1 1/2/2017 ENTERED BY B
3 1 1/2/2017 BLAH BLAH BLAH
4 1 1/2/2017 BLAH BLAH BLAH
5 3 1/1/2017 ENTERED BY C
详情:
print (df.loc[df['TEXT'].str.contains('ENTERED BY'), ['ID#','DATE']].drop_duplicates())
ID# DATE
0 1 1/1/2017
2 1 1/2/2017
7 3 1/1/2017
<强>计时强>:
np.random.seed(123)
N = 100000
L = ['AV','DF','SD','RF','F','WW','FG','SX']
dates = pd.date_range('2015-01-01', '2015-02-20')
df = pd.DataFrame({'TEXT': np.random.choice(L, N),
'ID#':np.random.randint(3000, size=N),
'DATE': np.random.choice(dates, N)})
.sort_values(['ID#','DATE']).reset_index(drop=True)
#print (df)
In [375]: %timeit df.loc[df['TEXT'].str.contains('A'), ['ID#','DATE']].drop_duplicates().merge(df)
10 loops, best of 3: 96.1 ms per loop
In [376]: %timeit df[df['TEXT'].str.contains('A').groupby([df['ID#'],df['DATE']]).transform('any')]
1 loop, best of 3: 6.56 s per loop
#Wen solution
In [377]: %timeit df.groupby(['ID#','DATE'],as_index=False).filter(lambda x : x.TEXT.str.contains('A').sum().any())
1 loop, best of 3: 30.1 s per loop