Question

df（Pandas Dataframe）有三行。

col_name
"This is Donald."
"His hands are so small"
"Why are his fingers so short?"

我想提取包含“is”和“small”的行。

如果我这样做

df.col_name.str.contains("is|small", case=False)

然后它也抓住了“他的” - 这是我不想要的。

以下查询是否是捕获df.series中整个单词的正确方法？

df.col_name.str.contains("\bis\b|\bsmall\b", case=False)

Answer 1

不，正则表达式position:relative将失败，因为您使用的是/bis/b|/bsmall/b，而不是/b，这意味着“字边界”。

更改它并获得匹配。我建议使用

\b

正如我所说的那样，正则表达式更快，更清晰一点。

Answer 2

你的方式（用/ b）对我不起作用。我不确定为什么你不能使用逻辑运算符和（＆amp;），因为我认为这是你真正想要的。

这是一种愚蠢的方式，但它有效：

mask = lambda x: ("is" in x) & ("small" in x)
series_name.apply(mask)

Answer 3

首先，您可能希望将所有内容转换为小写，删除标点符号和空格，然后将结果转换为一组单词。

import string

df['words'] = [set(words) for words in
    df['col_name']
    .str.lower()
    .str.replace('[{0}]*'.format(string.punctuation), '')
    .str.strip()
    .str.split()
]

>>> df
                        col_name                                words
0                This is Donald.                   {this, is, donald}
1         His hands are so small         {small, his, so, are, hands}
2  Why are his fingers so short?  {short, fingers, his, so, are, why}

您现在可以使用布尔索引来查看所有目标词是否都在这些新词集中。

target_words = ['is', 'small']
# Convert target words to lower case just to be safe.
target_words = [word.lower() for word in target_words]

df['match'] = df.words.apply(lambda words: all(target_word in words 
                                               for target_word in target_words))


print(df)
# Output: 
#                         col_name                                words  match
# 0                This is Donald.                   {this, is, donald}  False
# 1         His hands are so small         {small, his, so, are, hands}  False
# 2  Why are his fingers so short?  {short, fingers, his, so, are, why}  False    

target_words = ['so', 'small']
target_words = [word.lower() for word in target_words]

df['match'] = df.words.apply(lambda words: all(target_word in words 
                                               for target_word in target_words))

print(df)
# Output:
# Output: 
#                         col_name                                words  match
# 0                This is Donald.                   {this, is, donald}  False
# 1         His hands are so small         {small, his, so, are, hands}   True
# 2  Why are his fingers so short?  {short, fingers, his, so, are, why}  False

提取匹配的行：

>>> df.loc[df.match, 'col_name']
# Output:
# 1    His hands are so small
# Name: col_name, dtype: object

使用布尔索引将此全部转换为单个语句：

df.loc[[all(target_word in word_set for target_word in target_words) 
        for word_set in (set(words) for words in
                         df['col_name']
                         .str.lower()
                         .str.replace('[{0}]*'.format(string.punctuation), '')
                         .str.strip()
                         .str.split())], :]

Answer 4

在"\bis\b|\bsmall\b"中，反斜杠\b甚至在传递给正则表达式方法进行匹配/搜索之前，都被解析为ASCII退格键。有关更多信息，请检查this document about escape characters。在本文档中提到

如果存在'r'或'R'前缀，则字符串中包含反斜杠后的字符而无需更改，并且所有反斜杠都保留在字符串中。

使用r前缀

df.col_name.str.contains(r"\bis\b|\bsmall\b", case=False)

转义\字符-

df.col_name.str.contains("\\bis\\b|\\bsmall\\b", case=False)

如果您想看一个例子，这里是Fiddle

Answer 5

在讨论的扩展中，我想按如下所示在正则表达式中使用一个变量：

df = df_w[df_w['Country/Region'].str.match("\b(location.loc[i]['country'])\b",case=False)]

如果我不输入\ b \ b，代码将返回苏丹和南苏丹的所有列。而当我使用“ \ b（location.loc [i] ['country']）\ b”时，它将返回空的数据帧。请告诉我正确用法。

python pandas.Series.str.contains WHOLE WORD

5 个答案: