Question

说我有一个字符串列表，例如

listStrings = [ 'cat', 'bat', 'hat', 'dad', 'look', 'ball', 'hero', 'up']

如果特定列包含列表中的3个或更多字符串，是否有办法返回所有行？

例如

如果该列包含“我的父亲是拯救猫的英雄”

然后将返回该行。

但是如果该列仅包含“猫和蝙蝠联手寻找食物”

该行将不返回。

我能想到的唯一方法是从字符串列表中获取3的每个组合，并使用AND语句。例如“猫”，“蝙蝠”和“帽子”。

但是，这似乎没有计算效率，也不是pythonic。

是否有更有效，更紧凑的方法来做到这一点？

编辑

这是一个熊猫的例子

import pandas as pd 

listStrings = [ 'cat', 'bat', 'hat', 'dad', 'look', 'ball', 'hero', 'up']

df = pd.DataFrame(['test1', 'test2', 'test3'], ['My dad is a hero for saving the cat', 'the cat and bat teamed up to find some food', 'The dog found a bowl'])
df.head()


0
My dad is a hero for saving the cat test1
the cat and bat teamed up to find some food test2
The dog found a bowl    test3

因此，我想使用listStrings返回第1行，但不返回第2行或第3行。

Answer 1

您可以从列表字符串中进行设置。制作一个接受行并检查每个单词是否在集合中的函数。每次输入一个单词时，将计数器加1。如果计数器等于3，则返回true。如果计数器小于3，并且您已经完成了对行的检查，则返回False。

将此功能应用于行。 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

O（n）空间用于每一行的O（m）复杂度（因为每个单词的集合是O（1）查找），其中m是行的大小。

Answer 2

您可以使用set itersection：

import pandas as pd 

listStrings =  {'A', 'B'}    
df = pd.DataFrame({'text': ['A B', 'B C', 'C D']})

df = df.loc[df.text.apply(lambda x: len(listStrings.intersection(x.split())) >= 2)]
print(df)

输出：

  text
0  A B

Answer 3

您可以构建一个数据框架，将句子作为列，将listStrings中的单词作为索引，如果单词在句子中，则值为1，否则为0。

在对该数据框求和时，将获得一个与示例数据框具有相同索引的Series，其值是句子中的单词数：您可以使用它来选择具有（大于）特定数量的行的行。他们。

可能的代码：

resul = pd.DataFrame({ix:
                      [1 if word in ix.split() else 0 for word in listStrings]
                      for ix in df.index}).sum()
print(df[resul>=3])

它给出：

                                                 0
My dad is a hero for saving the cat          test1
the cat and bat teamed up to find some food  test2

熊猫：如果列字符串至少包含列表中一定数量的字符串，如何返回所有行？

3 个答案: