Question

我有一个如下所示的嵌套列表：

test = [['hello', 'hola'], ['hello, 'bonjour', 'hola'], ['hello', 'ciao', 'namaste'], ['hola', 'ciao'], ['hola', 'ciao], ['namaste', 'bonjour', 'bonjour']]

我感兴趣的是从每个子列表中删除所有元素，如果它至少存在于X个子列表中（共享定义的共享单词）。对于这个例子，如果我们设置X = 3，只有值'hello'和'hola'和'ciao'将保留在任何列表中，产生：

shared = [['hello', 'hola'], ['hello, 'hola'], ['hello', 'ciao'], ['hola', 'ciao'], ['hola', 'ciao], []]

我还想要另一个具有完全反逻辑的列表，保留少于X个子列表总数的值，从而从所有列表中删除'hello'和'hola'和'ciao'。

如何做到这一点？我会在这里放一些代码，但是作为初学者在Python中编写登录信息时我迷失了。

谢谢，千斤顶

编辑：请注意，bonjour显示3次，但仅限于两个子列表，因此不会被视为共享。

Answer 1

首先，在展平列表中使用 collections.Counter 查找每个单词的计数，使用一组忽略单个子列表中的重复值：

appearances = Counter(word for sub in arr for word in set(sub))
# Counter({'hola': 4, 'hello': 3, 'ciao': 3, 'bonjour': 2, 'namaste': 2})

接下来，使用列表推导和字典查找（O（1）操作）仅返回出现在足够的子列表中的单词：

[[word for word in sub if appearances[word] >= threshold] for sub in arr]

将所有内容放在一个简单的函数中并返回所需的结果：

from collections import Counter

def threshold_filter(arr, threshold):
  appearances = Counter(word for sub in arr for word in set(sub))

  return [
    [word for word in sub if appearances[word] >= threshold] 
    for sub in arr
  ]

print(threshold_filter(test, 3))

# Result 
[['hello', 'hola'], ['hello', 'hola'], ['hello', 'ciao'], ['hola', 'ciao'], ['hola', 'ciao'], []]

仅当存在于X个其他子列表中时才保留嵌套列表中的项目

1 个答案: