Question

如何从包含3列的数据框中找到col1中找到的第一个系列的匹配项？我需要能够使用正则表达式，因为我的系列包含*作为该字段中任何内容的占位符。

我有一个由以下数据组成的熊猫系列：

col1
joe\creed\found\match
matt\creed\*\not
adam\creed\notfound\match

我有另一个数据框，其数据如下：

col1                       col2 col3
joe2\creed2\found\match    2    23
matt2\creed2\found2\not    2    23
adam\creed\notfound\match  2    23
matt\creed\found\not       2    23

我试图执行以下代码但没有成功。

for item in series:
    print(df[df.col1.str.contains(item, regex=True)]

和

for item in series:
    print(df[df.col1.isin([str(item)])

我的预期输出如下：

col1                       col2 col3
adam\creed\notfound\match  2    23
matt\creed\found\not       2    23

Answer 1

你可以这样做：

数据：

In [163]: s Out[163]: 0 joe\creed\found\match 1 matt\creed\*\not 2 adam\creed\notfound\match Name: col1, dtype: object In [164]: df Out[164]: col1 col2 col3 0 joe2\creed2\found\match 2 23 1 matt2\creed2\found2\not 2 23 2 adam\creed\notfound\match 2 23 3 matt\creed\found\not 2 23

<强>解决方案：

import re # replacing '*' --> '[^\\]*' (in the escaped string: '\\\*' --> '[^\\\\]*') pat = s.apply(re.escape).str.replace(r'\\\*', r'[^\\\\]*').str.cat(sep='|') # use the following line instead, if `s` is a DataFrame (not a Series): #pat = s.col1.apply(re.escape).str.replace(r'\\\*', r'[^\\\\]*').str.cat(sep='|') In [161]: df[df.col1.str.contains(pat)] Out[161]: col1 col2 col3 2 adam\creed\notfound\match 2 23 3 matt\creed\found\not 2 23 In [162]: pat Out[162]: 'joe\\\\creed\\\\found\\\\match|matt\\\\creed\\\\[^\\\\]*\\\\not|adam\\\\creed\\\\notfound\\\\match'

主要的难点是正确地逃避＆＃34;搜索模式中的所有特殊字符（如\）＆＃34;系列。

在另一个系列中查找一个系列的匹配，并使用正则表达式匹配

1 个答案: