我有一个很大的数据框,其中一个列中有一个单词的多个单词变体。我想根据要查找的特定单词来过滤行。示例数据帧如下。在这里,我想过滤在“解决方案”列中包含单词“创建”但不包含其子字符串(例如“重新创建”或“重新创建”)的行。
注意:我只想在str.contains
In [4]: df = pd.DataFrame({"Resolution":["create profile", "recreate profile", "re-create profile", "created profile",
...: "re-created profile", "closed outlook and recreated profile", "purged outlook processes and created new profile
...: "], "Product":["Outlook", "Outlook", "Outlook", "Outlook", "Outlook", "Outlook", "Outlook"]})
In [5]: df
Out[5]:
Resolution Product
0 create profile Outlook
1 recreate profile Outlook
2 re-create profile Outlook
3 created profile Outlook
4 re-created profile Outlook
5 closed outlook and recreated profile Outlook
6 purged outlook processes and created new profile Outlook
我的尝试
我已经能够过滤“重新创建”和“重新创建”(过去式无关紧要):
In [13]: df[df.Resolution.str.contains("(?=.*recreate|re-create)(?=.*profile)")]
Out[13]:
Resolution Product
1 recreate profile Outlook
2 re-create profile Outlook
4 re-created profile Outlook
5 closed outlook and recreated profile Outlook
问题:如何修改正则表达式,使行仅包含“创建”而不是子字符串?像这样:
Resolution Product
0 create profile Outlook
3 created profile Outlook
6 purged outlook processes and created new profile Outlook
答案 0 :(得分:1)
添加~
用于反转条件:
df = df[~df.Resolution.str.contains("(?=.*recreate|re-create)(?=.*profile)")]
print (df)
Resolution Product
0 create profile Outlook
3 created profile Outlook
6 purged outlook processes and created new profile Outlook