使用其他系列或列作为模式的熊猫str.extract / re.search

时间:2018-07-23 13:39:05

标签: python pandas

df = pd.DataFrame({'a':{0:'aa',1:'dd',2:'cc'},
                   'b':{0:'aa(bb)daa',1:'eedd(ed)',2:'affaa(f)'}})

    a   b
0   aa  aa(bb)daa
1   dd  eedd(ed)
2   cc  affaa(f)

我想提取括号内的字符,只要括号前的模式是df ['a']中的值即可。

我尝试使用:

def searcher(x):
    pat_result = re.search(x[0] + '\((.*?)\)', x[1])
    if pat_result:
        return pat_result.group(1)

df[['a','b']].apply(lambda x :searcher(x), axis=1)

0      bb
1      ed
2    None
dtype: object

%%timeit
df[['a','b']].apply(lambda x :searcher(x), axis=1)
1.33 ms ± 3.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


我只是想知道是否有更快的方法(但仍然在熊猫中)或直接使用str.extract?


有没有办法使这项工作可行?

df['b'].str.extract(df['a'] + '\((.*?)\)', expand=False)

1 个答案:

答案 0 :(得分:0)

Here's a solution that uses a loop. I ran the solution a few times and have gotten different times varying from faster to slower than the original solution.

%%timeit
for i, j  in df.iterrows():
    pat_search = re.search(j['a']  + '\((.*?)\)', j['b'])
    if pat_search:
        j['c'] = pat_search.group(1)
#First Iteration
264 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#Second Iteration
1.62 ms ± 78.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For comparison to your original solution,

%%timeit
df[['a','b']].apply(lambda x :searcher(x), axis=1)

1.34 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)