当行包含指定的子字符串pandas时,将所有行保留在组中

时间:2017-10-31 15:00:09

标签: python pandas pandas-groupby

我的数据看起来像这样:

    ID#     DATE        TEXT
    1       1/1/2017    ENTERED BY A
    1       1/1/2017    BLAH BLAH BLAH
    1       1/2/2017    ENTERED BY B
    1       1/2/2017    BLAH BLAH BLAH
    1       1/2/2017    BLAH BLAH BLAH
    2       1/4/2017    SUPPLEMENTAL PAYMENT BY A
    2       1/4/2017    BLAH BLAH BLAH
    3       1/1/2017    ENTERED BY C
    3       1/2/2017    CHANGED COMPANY NAME
    3       1/2/2017    BLAH BLAH BLAH

我正在尝试按ID#和DATE对数据进行分组,并在组(在此我对案例ID#和DATE进行分组)具有文本匹配时返回所有行。

这是我到目前为止所得到的。下面的代码试图在TEXT字段中搜索子字符串'ENTERED BY'的每一行,并返回与该组关联的所有行。

    notes[notes.groupby('ID#','DATE',as_index=False).apply(lambda x: x['TEXT'].str.contains('ENTERED BY'))]

我也尝试过group.filter()的变体,结果相似。有谁能指出我正确的方向?我的输出集应如下所示:

    ID#     DATE        TEXT
    1       1/1/2017    ENTERED BY A
    1       1/1/2017    BLAH BLAH BLAH
    1       1/2/2017    ENTERED BY B
    1       1/2/2017    BLAH BLAH BLAH
    1       1/2/2017    BLAH BLAH BLAH
    3       1/1/2017    ENTERED BY C

谢谢!

1 个答案:

答案 0 :(得分:3)

您可以将groupby + transformany一起使用,然后按boolean indexing进行过滤:

df=df[df['TEXT'].str.contains('ENTERED BY').groupby([df['ID#'],df['DATE']]).transform('any')]
print (df)
   ID#      DATE            TEXT
0    1  1/1/2017    ENTERED BY A
1    1  1/1/2017  BLAH BLAH BLAH
2    1  1/2/2017    ENTERED BY B
3    1  1/2/2017  BLAH BLAH BLAH
4    1  1/2/2017  BLAH BLAH BLAH
7    3  1/1/2017    ENTERED BY C

详情:

print (df['TEXT'].str.contains('ENTERED BY'))
0     True
1    False
2     True
3    False
4    False
5    False
6    False
7     True
8    False
9    False
Name: TEXT, dtype: bool

print(df['TEXT'].str.contains('ENTERED BY').groupby([df['ID#'],df['DATE']]).transform('any'))
0     True
1     True
2     True
3     True
4     True
5    False
6    False
7     True
8    False
9    False
Name: TEXT, dtype: bool

另一个更快的解决方案是使用drop_duplicatesmerge过滤所有ID#DATE

df=df.loc[df['TEXT'].str.contains('ENTERED BY'), ['ID#','DATE']].drop_duplicates().merge(df)
print (df)
   ID#      DATE            TEXT
0    1  1/1/2017    ENTERED BY A
1    1  1/1/2017  BLAH BLAH BLAH
2    1  1/2/2017    ENTERED BY B
3    1  1/2/2017  BLAH BLAH BLAH
4    1  1/2/2017  BLAH BLAH BLAH
5    3  1/1/2017    ENTERED BY C

详情:

print (df.loc[df['TEXT'].str.contains('ENTERED BY'), ['ID#','DATE']].drop_duplicates())
   ID#      DATE
0    1  1/1/2017
2    1  1/2/2017
7    3  1/1/2017

<强>计时

np.random.seed(123)
N = 100000

L = ['AV','DF','SD','RF','F','WW','FG','SX']
dates = pd.date_range('2015-01-01', '2015-02-20')
df = pd.DataFrame({'TEXT': np.random.choice(L, N),
                   'ID#':np.random.randint(3000, size=N),
                   'DATE': np.random.choice(dates, N)})
       .sort_values(['ID#','DATE']).reset_index(drop=True)
#print (df)
In [375]: %timeit df.loc[df['TEXT'].str.contains('A'), ['ID#','DATE']].drop_duplicates().merge(df)
10 loops, best of 3: 96.1 ms per loop

In [376]: %timeit df[df['TEXT'].str.contains('A').groupby([df['ID#'],df['DATE']]).transform('any')]
1 loop, best of 3: 6.56 s per loop

#Wen solution
In [377]: %timeit df.groupby(['ID#','DATE'],as_index=False).filter(lambda x : x.TEXT.str.contains('A').sum().any())
1 loop, best of 3: 30.1 s per loop