Question

以下是我的清单：

col = [['red', 'yellow', 'blue', 'red', 'green', 'yellow'],
       ['pink', 'orange', 'brown', 'pink', 'brown']
      ]

我的目标是消除每个列表中出现一次的项目。

这是我的代码：

eliminate = [[w for w in c if c.count(w)>1]for c in col]

Output: [['red', 'red', 'yellow','yellow'], ['pink','pink', 'brown','brown']]

该代码适用于小型数据集，例如上面的列表，但是，我的数据集非常大。每个列表最多包含1000个项目。

有没有办法让上面的代码更快？比如将代码分解为两个或更多个for循环，因为我的理解是正常的for循环比列表理解更快。

有什么建议吗？谢谢。

Answer 1

我想尝试OrderedCounter以避免重复的.count()来电：

from collections import OrderedDict, Counter

col=[['red', 'yellow', 'blue', 'red', 'green', 'yellow'],['pink', 'orange', 'brown', 'pink', 'brown']]

class OrderedCounter(Counter, OrderedDict):
    pass

new = [[k for k, v in OrderedCounter(el).iteritems() if v != 1] for el in col]
# [['red', 'yellow'], ['pink', 'brown']]

如果我们只想迭代一次，那么（类似于Martijn的 - 再加上少用套装）：

from itertools import count
def unique_plurals(iterable):
    seen = {}
    return [el for el in iterable if next(seen.setdefault(el, count())) == 1]

new = map(unique_plurals, col)

在指定需要出现的次数方面更灵活，并且保留一个dict而不是多个set。

Answer 2

不要使用.count()，因为它会扫描列表中的每个元素。此外，如果它们在输入中出现3次或更多次，它会多次向输出添加项目。

你最好在这里使用一个生成器函数，它只生成以前见过的项目，但只有一次：

def unique_plurals(lst):
    seen, seen_twice = set(), set()
    seen_add, seen_twice_add = seen.add, seen_twice.add
    for item in lst:
        if item in seen and item not in seen_twice:
            seen_twice_add(item)
            yield item
            continue
        seen_add(item)

[list(unique_plurals(c)) for c in col]

这只通过每个列表迭代一次（与使用Counter()不同）。

此方法远更快：

>>> timeit('[[k for k, v in OrderedCounter(el).iteritems() if v != 1] for el in col]', 'from __main__ import col, OrderedCounter') 52.00807499885559 >>> timeit('[[k for k, v in Counter(el).iteritems() if v != 1] for el in col]', 'from __main__ import col, Counter') 15.766052007675171 >>> timeit('[list(unique_plurals(c)) for c in col]', 'from __main__ import col, unique_plurals') 6.946599006652832 >>> timeit('[list(unique_plurals_dict(c)) for c in col]', 'from __main__ import col, unique_plurals_dict') 6.557853937149048

这比OrderedCounter方法快8倍，是Counter方法的2.2倍。

但是Jon的单字典加计数器方法仍然更快！

但是，如果您只需要消除仅出现一次的值，但保持其余部分完整无缺，包括重复，那么您将使用（从Jon借用）：

from itertools import count from collections import defaultdict def nonunique_plurals(lst): seen = defaultdict(count) for item in lst: cnt = next(seen[item]) if cnt: if cnt == 1: # yield twice to make up for skipped first item yield item yield item

这会产生：

>>> [list(nonunique_plurals(c)) for c in col] [['red', 'red', 'yellow', 'yellow'], ['pink', 'pink', 'brown', 'brown']] >>> timeit('[non_uniques(c) for c in col]', 'from __main__ import col, non_uniques') 17.75499200820923 >>> timeit('[list(nonunique_plurals(c)) for c in col]', 'from __main__ import col, unique_plurals') 9.306739091873169

这几乎是Counter()解决方案proposed by FMc速度的两倍，但它并没有精确地保留顺序：

>>> list(nonunique_plurals(['a', 'a', 'b', 'a', 'b', 'c'])) ['a', 'a', 'a', 'b', 'b'] >>> non_uniques(['a', 'a', 'b', 'a', 'b', 'c']) ['a', 'a', 'b', 'a', 'b']

Answer 3

我的理解是，正常的for循环比列表理解更快。

不。

你的循环很慢，因为它重复了操作。对于col中每个嵌套列表中的每个字符串，它会对该字符串的实例数进行计数，因此对于c中的每个col，它会执行len(c)**2次比较。这是一个O(NM^2)平方算法。这很快就会变慢。

为了加快速度，请使用更好的数据结构。使用collections.Counter。

Answer 4

这解决了您的修订问题：它确实对内部列表进行了两次传递（首先进行计数，然后进行检索），因此不会尽可能快;但是，它保留了秩序并且非常易读。像往常一样，权衡比比皆是！

from collections import Counter

cols = [
    ['red', 'yellow', 'blue', 'red', 'green', 'yellow'],
    ['pink', 'orange', 'brown', 'pink', 'brown'],
]

def non_uniques(vals):
    counts = Counter(vals)
    return [v for v in vals if counts[v] > 1]

non_uniqs = map(non_uniques, cols)

# [
#    ['red', 'yellow', 'red', 'yellow'],
#    ['pink', 'brown', 'pink', 'brown'],
# ]

如何加快列表理解速度

4 个答案: