Question

我有一个大熊猫系列，其中我以这种方式应用字符串搜索

df['column_name'].str.contains('test1')

这给了我真/假列表，具体取决于字符串'test1'是否包含在列'column_name'中。

但是我无法测试两个字符串，我需要检查两个字符串是否存在。像

这样的东西

  df['column_name'].str.contains('test1' and 'test2')

这似乎不起作用。任何建议都会很棒。

Answer 1

不必创建2个条件并使用&并根据运算符优先级围绕条件括起括号：

(df['column_name'].str.contains('test1')) & (df['column_name'].str.contains('test2))

如果您想测试任何一个单词，那么以下内容将起作用：

df['column_name'].str.contains('test1|test2')

Answer 2

all( word in df['column_name'] for word in ['test1', 'test2'] )

这将测试字符串中存在的任意数字或单词

Answer 3

忽略来自'test2，＆＃39;及＆＃39;的错误引用。 operator是一个布尔逻辑运算符。它不会连接字符串，也不会执行您认为的操作。

>>> 'test1' and 'test2'
'test2'
>>> 'test1' or 'test2'
'test1'
>>> 10 and 20
20
>>> 10 and 0
10
>>> 0 or 20
20
>>> # => and so on...

发生这种情况是因为and和or运算符充当了真相决策者＆＃39;并且对字符串有轻微的奇怪行为。本质上，返回值是已经评估的最后一个值，无论它是字符串还是其他值。看看这个行为：

>>> a = 'test1'
>>> b = 'test2'
>>> c = a and b
>>> c is a
False
>>> c is b
True

后一个值被赋给我们给它的变量。您正在寻找的是一种迭代列表或字符串集的方法，并确保所有字符串都是真的。我们使用all(iterable)函数。

if all([df['column_name'].contains(_) for _ in ['test1', 'test2']]):
    print("All strings are contained in it.")
else:
    print("Not all strings are contained in it.")

假设情况属实，以下是您收到的一个示例：

>>> x = [_ in df['column_name'] for _ in ['test1', 'test2']
>>> print(x)
[True, True] # => returns True for all()
>>> all(x)
True
>>> x[0] = 'ThisIsNotIntTheColumn' in df['column_name']
>>> print(x)
[False, True]
>>> all(x)
False

Answer 4

您想知道test1和test2是否在列中的某个位置。

所以df['col_name'].str.contains('test1').any() & df['col_name'].str.contains('test2').any()。

string.contains

4 个答案: