Question

我有一个带有长文本字段和短字符串的数据框，该字符串本质上是一个类别。我的目标是利用正则表达式在数据框中创建一个新列，对应于是否存在匹配项。正则表达式以类别为条件。这是一个示例：

a = ['the dog is mad and sad 50', 'the cat is happy']
b = ['dog', 'cat']
regex = ['[0-9]{2}', '[0-9]{3}']

ab = pd.DataFrame(zip(a,b,regex), columns = ['text', 'category', 'pattern'])

在上面的示例中，为避免使用for循环遍历每个类别，我将模式设置为数据帧中的字符串列，并希望将模式列用作正则表达式。

但是，当我运行以下命令时，出现错误

ab['match'] = np.where(ab[ab['text'].str.contains(ab['pattern'], regex = True)], 1, 0)

TypeError: 'Series' objects are mutable, thus they cannot be hashed

数据帧非常大，可能有很多类别，因此，首选上述矢量化解决方案。

Answer 1

如果要将特定的正则表达式应用于特定的行，则不能使用vectorized approach。您必须使用逐行应用：

import re

ab['match'] = ab.apply(lambda row: bool(re.search(row['pattern'], row['text'])), axis=1)

                        text category   pattern  match
0  the dog is mad and sad 50      dog  [0-9]{2}   True
1           the cat is happy      cat  [0-9]{3}  False

使用其他熊猫列来指定Series.contains

1 个答案: