Question

我有一个很大的数据框，其中一个列中有一个单词的多个单词变体。我想根据要查找的特定单词来过滤行。示例数据帧如下。在这里，我想过滤在“解决方案”列中包含单词“创建”但不包含其子字符串（例如“重新创建”或“重新创建”）的行。

注意：我只想在str.contains

中使用正则表达式解决方案

In [4]: df = pd.DataFrame({"Resolution":["create profile", "recreate profile", "re-create profile", "created profile",
   ...: "re-created profile", "closed outlook and recreated profile", "purged outlook processes and created new profile
   ...: "], "Product":["Outlook", "Outlook", "Outlook", "Outlook", "Outlook", "Outlook", "Outlook"]})

In [5]: df
Out[5]:
                                         Resolution  Product
0                                    create profile  Outlook
1                                  recreate profile  Outlook
2                                 re-create profile  Outlook
3                                   created profile  Outlook
4                                re-created profile  Outlook
5              closed outlook and recreated profile  Outlook
6  purged outlook processes and created new profile  Outlook

我的尝试

我已经能够过滤“重新创建”和“重新创建”（过去式无关紧要）：

In [13]: df[df.Resolution.str.contains("(?=.*recreate|re-create)(?=.*profile)")]
Out[13]:
                             Resolution  Product
1                      recreate profile  Outlook
2                     re-create profile  Outlook
4                    re-created profile  Outlook
5  closed outlook and recreated profile  Outlook

问题：如何修改正则表达式，使行仅包含“创建”而不是子字符串？像这样：

                                      Resolution  Product
0                                    create profile  Outlook
3                                   created profile  Outlook
6  purged outlook processes and created new profile  Outlook

Answer 1

添加~用于反转条件：

df = df[~df.Resolution.str.contains("(?=.*recreate|re-create)(?=.*profile)")]
print (df)
                                          Resolution  Product
0                                     create profile  Outlook
3                                    created profile  Outlook
6  purged outlook processes and created new profile   Outlook

过滤熊猫系列中的特定单词（带有变体）

1 个答案: