Question

我有研究，但没有找到以下问题的答案。

如何对字符串列表中的子串列表进行布尔比较？

以下是代码：

string = {'strings_1': ['AEAB', 'AC', 'AI'], 
             'strings_2':['BB', 'BA', 'AG'], 
             'strings_3': ['AABD', 'DD', 'PP'], 
             'strings_4': ['AV', 'AB', 'BV']}

df_string = pd.DataFrame(data = string)

substring_list = ['AA', 'AE']

for row in df_string.itertuples(index = False):
    combine_row_str = [row[0], row[1], row[2]]

    #below is the main operation
    print(all(substring in row_str for substring in substring_list for row_str in combine_row_str))

我得到的输出是：

False
False
False

我想要的输出是：

True
False
False

Answer 1

这是使用pd.DataFrame.sum和列表理解的一种方式：

df = pd.DataFrame(data=string)

lst = ['AA', 'AE']

df['test'] = [all(val in i for val in lst) for i in df.sum(axis=1)]

print(df)

  strings_1 strings_2 strings_3 strings_4   test
0      AEAB        BB      AABD        AV   True
1        AC        BA        DD        AB  False
2        AI        AG        PP        BV  False

Answer 2

由于您正在使用pandas，因此可以使用regex调用apply-wise和str.contains来查找字符串是否匹配。第一步是查找是否有任何值与substring_list中的字符串匹配：

df_string.apply(lambda x: x.str.contains('|'.join(substring_list)), axis=1)

返回：

   strings_1  strings_2  strings_3  strings_4
0       True      False       True      False
1      False      False      False      False
2      False      False      False      False

现在，不清楚的是，如果两个子串存在于一行中或仅存在于其中任何一个中，是否要返回true。如果只有其中任何一个，你只需在contains（）方法之后添加any（）：

df_string.apply(lambda x: x.str.contains('|'.join(substring_list)).any(), axis=1)

返回：

0     True
1    False
2    False
dtype: bool

对于第二种情况，jpp提供了一个单行解决方案，其中将行元素合并为一个字符串，但是请注意，如果连续两个元素，例如“BBA”，则它不适用于角落情况“ABB”，你试图匹配“AA”。连字符串“BBAABB”仍将匹配“AA”，这是错误的。我想提出一个带有apply和额外函数的解决方案，以便代码更具可读性：

def areAllPresent(vals, patterns):
  result = []
  for pat in patterns:
    result.append(any([pat in val for val in vals]))
  return all(result)

df_string.apply(lambda x: areAllPresent(x.values, substring_list), axis=1)

由于您的示例数据框仍然会返回相同的结果，但它适用于需要匹配两者的情况：

0     True
1    False
2    False
dtype: bool

子串列表与字符串列表的布尔比较

2 个答案: