在熊猫中寻找琴弦

时间:2020-11-07 20:38:04

标签: python pandas string

我用于选择所有包含以下列表中的字符串之一的行的代码中有问题:

search_query=['great game', 'gran game']
filtered_query=df[(df['Text'].str.lower().str.contains("|", search_query)) | (df['Low_Content'].str.contains("|", search_query))]

filtered_query.drop_duplicates(subset =["User", "Low_Content"], keep = False, inplace = True)

上面的代码应过滤列表中至少包含两个字符串之一的所有行:

User            Text                             Low_Content
432         Great game!I liked it             We played yesterday
34          Good game, man.                    I like this sport
412         We played a GREAT GAME yesterday    Gran game!!!

代码应仅选择这些行

  User            Text                             Low_Content
    432         Great game!I liked it             We played yesterday  # it contains Great game in Text
    412         We played a GREAT GAME yesterday    Gran game!!!  # this contains both queries in both columns

我对发现伟大或游戏不感兴趣:我想找到两个词(与Gran游戏相同)。

如果上面的代码包含两个单词之一而不是两个字符串之一,则上面的代码似乎选择了行。

感谢您的帮助。谢谢

3 个答案:

答案 0 :(得分:1)

您使用的.str.contains错误。通过您的代码.str.contains("|", ...)在字符串“ |”上调用.str.contains。当传递给正则表达式时(如.str.contains一样),它将执行或运算,它将匹配运算符左侧或右侧的内容。在这种情况下,"|"的两边都有空字符串,这就是为什么要匹配所有行的原因(因为空字符串始终会匹配其中包含所有内容的字符串)

示例:

>>> import re
>>> re.search("", "abc")
<re.Match object; span=(0, 0), match=''>

您需要做的是将search_query连接到一个字符串中,其中的元素由"|"分隔(例如'great game|gran game'),以检查另一个字符串中是否存在这些元素。最后,您需要将case=False传递给.str.contains,以便我们执行不区分大小写的匹配(例如,"great game"将匹配"Great game")。

search_elements =['great game', 'gran game']
search_query = "|".join(search_elements)

mask = df["Text"].str.contains(search_query, case=False) | df["Low_Content"].str.contains(search_query, case=False)
subset = df.loc[mask, :]

print(subset)
   User                              Text          Low_Content
0   432             Great game!I liked it  We played yesterday
2   412  We played a GREAT GAME yesterday         Gran game!!!

答案 1 :(得分:1)

您正尝试在字符串中搜索“”或“”,因此如果您替换:

df[(df['Text'].str.lower().str.contains("|", search_query)) | (df['Low_Content'].str.contains("|", search_query))]

具有:

df[(df['Text'].str.lower().str.contains("great game|gran game")) | (df['Low_Content'].str.contains("great game|gran game"))]

问题将会解决。

答案 2 :(得分:1)

我将.str.contains()的使用更改为.str.contains("|".join(search_query)
现在,它将搜索:'great game|gran game',这是您要查找的正确正则表达式。

工作代码示例:

import pandas as pd
from io import StringIO

text = """
User\tText\tLow_Content
432\tGreat game!I liked it\tWe played yesterday
34\tGood game, man.\tI like this sport
412\tWe played a GREAT GAME yesterday\tGran game!!!
"""

df = pd.read_csv(StringIO(text), header=0, sep='\t')

search_query=['great game', 'gran game']

mask = (
    df['Text'].str.contains("|".join(search_query), case=False) 
    | df['Low_Content'].str.contains("|".join(search_query), case=False)
)

df[mask]

.str.contains()上的文档:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html