Question

考虑这个简单的例子

import pandas as pd

mydata = pd.DataFrame({'mystring' : ['heLLohelloy1', 'hAllohallo'],
                       'myregex' : ['hello.[0-9]', 'ulla']})

mydata
Out[3]: 
       myregex      mystring
0  hello.[0-9]  heLLohelloy1
1         ulla    hAllohallo

我想创建一个变量flag，用于标识mystring与同一行myregex中的正则表达式匹配的行。

也就是说，在该示例中，只有第一行heLLohelloy1与正则表达式hello.[0-9]匹配。实际上，hAllohallo与正则表达式ulla不匹配。

如何在熊猫中尽可能高效地完成这项工作？在这里，我们讨论的是数百万次观察（数据仍然适合RAM）。

Answer 1

您可以使用re library和apply function执行以下操作：

import re

# apply function
mydata['flag'] = mydata.apply(lambda row: bool(re.search(row['myregex'], row['mystring'])), axis=1)

### to convert bool to int - optional
### mydata['flag'] = mydata['flag'].astype(int)

       myregex      mystring    flag
0   hello.[0-9] heLLohelloy1    True
1   ulla        hAllohallo      False

Answer 2

我想出了这个解决方案，你能否检查一下是否符合你的要求

[pd.Series(y).str.contains(x)[0] for x,y in zip(mydata.myregex,mydata.mystring)]

Out[54]: [True, False]

或者我们使用map

list(map(lambda x: pd.Series(x[1]).str.contains(x[0])[0], zip(mydata.myregex,mydata.mystring)))
Out[56]: [True, False]

如何将字符变量与另一个变量中定义的正则表达式进行匹配？

2 个答案: