我用于选择所有包含以下列表中的字符串之一的行的代码中有问题:
search_query=['great game', 'gran game']
filtered_query=df[(df['Text'].str.lower().str.contains("|", search_query)) | (df['Low_Content'].str.contains("|", search_query))]
filtered_query.drop_duplicates(subset =["User", "Low_Content"], keep = False, inplace = True)
上面的代码应过滤列表中至少包含两个字符串之一的所有行:
User Text Low_Content
432 Great game!I liked it We played yesterday
34 Good game, man. I like this sport
412 We played a GREAT GAME yesterday Gran game!!!
代码应仅选择这些行
User Text Low_Content
432 Great game!I liked it We played yesterday # it contains Great game in Text
412 We played a GREAT GAME yesterday Gran game!!! # this contains both queries in both columns
我对发现伟大或游戏不感兴趣:我想找到两个词(与Gran游戏相同)。
如果上面的代码包含两个单词之一而不是两个字符串之一,则上面的代码似乎选择了行。
感谢您的帮助。谢谢
答案 0 :(得分:1)
您使用的.str.contains
错误。通过您的代码.str.contains("|", ...)
在字符串“ |”上调用.str.contains
。当传递给正则表达式时(如.str.contains
一样),它将执行或运算,它将匹配运算符左侧或右侧的内容。在这种情况下,"|"
的两边都有空字符串,这就是为什么要匹配所有行的原因(因为空字符串始终会匹配其中包含所有内容的字符串)
示例:
>>> import re
>>> re.search("", "abc")
<re.Match object; span=(0, 0), match=''>
您需要做的是将search_query
连接到一个字符串中,其中的元素由"|"
分隔(例如'great game|gran game'
),以检查另一个字符串中是否存在这些元素。最后,您需要将case=False
传递给.str.contains
,以便我们执行不区分大小写的匹配(例如,"great game"
将匹配"Great game"
)。
search_elements =['great game', 'gran game']
search_query = "|".join(search_elements)
mask = df["Text"].str.contains(search_query, case=False) | df["Low_Content"].str.contains(search_query, case=False)
subset = df.loc[mask, :]
print(subset)
User Text Low_Content
0 432 Great game!I liked it We played yesterday
2 412 We played a GREAT GAME yesterday Gran game!!!
答案 1 :(得分:1)
您正尝试在字符串中搜索“”或“”,因此如果您替换:
df[(df['Text'].str.lower().str.contains("|", search_query)) | (df['Low_Content'].str.contains("|", search_query))]
具有:
df[(df['Text'].str.lower().str.contains("great game|gran game")) | (df['Low_Content'].str.contains("great game|gran game"))]
问题将会解决。
答案 2 :(得分:1)
我将.str.contains()
的使用更改为.str.contains("|".join(search_query)
。
现在,它将搜索:'great game|gran game'
,这是您要查找的正确正则表达式。
工作代码示例:
import pandas as pd
from io import StringIO
text = """
User\tText\tLow_Content
432\tGreat game!I liked it\tWe played yesterday
34\tGood game, man.\tI like this sport
412\tWe played a GREAT GAME yesterday\tGran game!!!
"""
df = pd.read_csv(StringIO(text), header=0, sep='\t')
search_query=['great game', 'gran game']
mask = (
df['Text'].str.contains("|".join(search_query), case=False)
| df['Low_Content'].str.contains("|".join(search_query), case=False)
)
df[mask]
.str.contains()上的文档:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html