df(Pandas Dataframe)有三行。
col_name
"This is Donald."
"His hands are so small"
"Why are his fingers so short?"
我想提取包含“is”和“small”的行。
如果我这样做
df.col_name.str.contains("is|small", case=False)
然后它也抓住了“他的” - 这是我不想要的。
以下查询是否是捕获df.series中整个单词的正确方法?
df.col_name.str.contains("\bis\b|\bsmall\b", case=False)
答案 0 :(得分:6)
不,正则表达式position:relative
将失败,因为您使用的是/bis/b|/bsmall/b
,而不是/b
,这意味着“字边界”。
更改它并获得匹配。我建议使用
\b
正如我所说的那样,正则表达式更快,更清晰一点。
答案 1 :(得分:0)
你的方式(用/ b)对我不起作用。我不确定为什么你不能使用逻辑运算符和(&),因为我认为这是你真正想要的。
这是一种愚蠢的方式,但它有效:
mask = lambda x: ("is" in x) & ("small" in x)
series_name.apply(mask)
答案 2 :(得分:0)
首先,您可能希望将所有内容转换为小写,删除标点符号和空格,然后将结果转换为一组单词。
import string
df['words'] = [set(words) for words in
df['col_name']
.str.lower()
.str.replace('[{0}]*'.format(string.punctuation), '')
.str.strip()
.str.split()
]
>>> df
col_name words
0 This is Donald. {this, is, donald}
1 His hands are so small {small, his, so, are, hands}
2 Why are his fingers so short? {short, fingers, his, so, are, why}
您现在可以使用布尔索引来查看所有目标词是否都在这些新词集中。
target_words = ['is', 'small']
# Convert target words to lower case just to be safe.
target_words = [word.lower() for word in target_words]
df['match'] = df.words.apply(lambda words: all(target_word in words
for target_word in target_words))
print(df)
# Output:
# col_name words match
# 0 This is Donald. {this, is, donald} False
# 1 His hands are so small {small, his, so, are, hands} False
# 2 Why are his fingers so short? {short, fingers, his, so, are, why} False
target_words = ['so', 'small']
target_words = [word.lower() for word in target_words]
df['match'] = df.words.apply(lambda words: all(target_word in words
for target_word in target_words))
print(df)
# Output:
# Output:
# col_name words match
# 0 This is Donald. {this, is, donald} False
# 1 His hands are so small {small, his, so, are, hands} True
# 2 Why are his fingers so short? {short, fingers, his, so, are, why} False
提取匹配的行:
>>> df.loc[df.match, 'col_name']
# Output:
# 1 His hands are so small
# Name: col_name, dtype: object
使用布尔索引将此全部转换为单个语句:
df.loc[[all(target_word in word_set for target_word in target_words)
for word_set in (set(words) for words in
df['col_name']
.str.lower()
.str.replace('[{0}]*'.format(string.punctuation), '')
.str.strip()
.str.split())], :]
答案 3 :(得分:0)
在"\bis\b|\bsmall\b"
中,反斜杠\b
甚至在传递给正则表达式方法进行匹配/搜索之前,都被解析为ASCII退格键。有关更多信息,请检查this document about escape characters。在本文档中提到
如果存在'r'或'R'前缀,则字符串中包含反斜杠后的字符而无需更改,并且所有反斜杠都保留在字符串中。
r
前缀df.col_name.str.contains(r"\bis\b|\bsmall\b", case=False)
\
字符-df.col_name.str.contains("\\bis\\b|\\bsmall\\b", case=False)
如果您想看一个例子,这里是Fiddle
答案 4 :(得分:0)
在讨论的扩展中,我想按如下所示在正则表达式中使用一个变量:
df = df_w[df_w['Country/Region'].str.match("\b(location.loc[i]['country'])\b",case=False)]
如果我不输入\ b \ b,代码将返回苏丹和南苏丹的所有列。而当我使用“ \ b(location.loc [i] ['country'])\ b”时,它将返回空的数据帧。请告诉我正确用法。