Question

我有一个DataFrame'df'和一个字符串'l'列表。我想迭代列表，找到与列表中的字符串匹配的DataFrame行。如果列表元素中没有括号，则以下代码可以正常工作。似乎正则表达式没有正确定义，并且不知何故双括号不匹配。

import pandas as pd
import re

d = {'col1': ['100-(abc)','qwe-100-(abc)', '100-(abc)1', 
              'xyz', 'xyz2', 'zzz'], 
     'col2': ['100', '1001','200', '300', '400', '500']}

df = pd.DataFrame(d)

lst = ['100-(abc)', 'xyz']

for l in lst:
    print("======================")
    pattern = re.compile(r"(" + l + ")$")
    print(df[df.col1.str.contains(pattern, regex=True)])

结果：

======================
Empty DataFrame
Columns: [col1, col2]
Index: []
======================
  col1 col2
3  xyz  300

预期结果：

======================
  col1           col2
0  100-(abc)     100
1  qwe-100-(abc) 1001

======================
  col1 col2
3  xyz  300

Answer 1

您需要了解：

正则表达式有一些保留某些特殊用途的字符开头括号（，右括号）是其中之一。

如果要在正则表达式中使用任何这些字符作为文字，则需要使用反斜杠转义它们。如果您想匹配1+1=2，正确的正则表达式为1\+1=2。否则，加号具有特殊含义。与括号相同，如果您想匹配(abc)，则必须执行\(abc\)

import pandas as pd
import re

d = {'col1': ['100-(abc)','qwe-100-(abc)', '100-(abc)1',
              'xyz', 'xyz2', 'zzz'],
     'col2': ['100', '1001','200', '300', '400', '500']}

df = pd.DataFrame(d)

lst = ['100-(abc)', 'xyz']


for l in lst:
    print("======================")
    if '(' in l:
        match=l.replace('(','\(').replace(')','\)')
        pattern = r"(" + match + ")$"
        print(df[df.col1.str.contains(pattern, regex=True)])
    else:
        pattern = r"(" + l + ")$"
        print(df[df.col1.str.contains(pattern, regex=True)])

输出：

            col1  col2
0      100-(abc)   100
1  qwe-100-(abc)  1001
======================
  col1 col2
3  xyz  300

Answer 2

只需使用isin

df[df.col1.isin(lst)]


    col1        col2
0   100-(abc)   100
3   xyz         300

编辑：与isin

一起添加正则表达式模式

df[(df.col1.isin(lst)) | (df.col1.str.contains('\d+-\(.*\)$', regex = True))]

你得到了

    col1            col2
0   100-(abc)       100
1   qwe-100-(abc)   1001
3   xyz             300

Python - 匹配变量正则表达式，包括双括号

2 个答案: