Question

这是来自this topic的扩展问题。我想在字符串中搜索总字符串和部分字符串，如下面的关键字Series＆＃34; w＆＃34;：

rigour*
*demeanour*
centre*
*arbour
fulfil

这显然意味着我想要搜索严谨和严谨的词语 s ， en 的举止和行为举止 s ，中心和中心 s ， h 乔木和乔木，并履行。所以我拥有的关键字列表是完整和部分字符串的混合。我想在此DataFrame＆＃34; df＆＃34;：

上应用搜索

ID;name
01;rigour
02;rigours
03;endemeanour
04;endemeanours
05;centre
06;centres
07;encentre
08;fulfil
09;fulfill
10;harbour
11;arbour
12;harbours

到目前为止，我尝试了以下内容：

r = re.compile(r'.*({}).*'.format('|'.join(w.values)), re.IGNORECASE)

然后我构建了一个掩码来过滤DataFrame：

mask = [m.group(1) if m else None for m in map(r.search, df['Tweet'])]

为了获得一个找到关键字的新列：

df['keyword'] = mask

我期待的是以下生成的DataFrame：

ID;name;keyword
01;rigour;rigour
02;rigours;rigour
03;endemeanour;demeanour
04;endemeanours;demeanour
05;centre;centre
06;centres;centre
07;encentre;None
08;fulfil;fulfil
09;fulfill;None
10;harbour;arbour
11;arbour;arbour
12;harbours;None

这使用没有*的w列表。现在我在使用*条件格式化关键字w单词列表时遇到了一些问题，以便正确运行re.compile函数。

任何帮助都会非常感激。

Answer 1

看起来您的输入系列w需要调整为用作正则表达式模式：

rigour.*
.*demeanour.*
centre.*
\\b.*arbour\\b
\\bfulfil\\b

请注意，正则表达式中的*依赖于它本身不起作用的东西。这意味着无论它遵循什么都可以重复0次或更多次。

另请注意，fulfil是fulfill的一部分，如果您想要严格匹配，则需要告诉正则表达式。例如，通过使用“单词分隔符” - \b - 它将仅捕获整个字符串。

以下是正则表达式为您提供所需结果的方式：

s = '({})'.format('|'.join(w.values))
r = re.compile(s, re.IGNORECASE)
r

re.compile(r'(rigour.*|.*demeanour.*|centre*|\b.*arbour\b|\bfulfil\b)', re.IGNORECASE)

您可以使用pandas .where方法完成替换代码：

df['keyword'] = df.name.where(df.name.str.match(r), None)
df

            ID          name       keyword
        0    1        rigour        rigour
        1    2       rigours       rigours
        2    3   endemeanour   endemeanour
        3    4  endemeanours  endemeanours
        4    5        centre        centre
        5    6       centres       centres
        6    7      encentre          None
        7    8        fulfil        fulfil
        8    9       fulfill          None
        9   10       harbour       harbour
        10  11        arbour        arbour
        11  12      harbours          None

pandas和“re” - 搜索全部和部分字符串

1 个答案: