Question

我有一个长度为370000的列表。在此列表中，我有以下项目："a", "y", "Y", "q", "Q", "p", "P",，这意味着这是一个单词列表，但我不时会得到这些单个字符。

我想从列表中删除这些字符，我在python中很新，但我想到的第一件事就是做一些事情：

for word in words:
    if word== 'm' or  word== 'y' or word== 'Y' or word== 'p' or word== 'Q' or word== 'q' or word== 'a' or word== 'uh':
        words.remove(word)

在一个包含370.000项目的列表中，这种方法正在耗费时间。说真的，很多。

有没有人对如何获得更好的表现有另一个很棒的想法？

提前致谢。

Answer 1

在IPython中试过一些bogo-benchmarkmark。

import random
# Don't know how to generate your words, use integers as substitute.
words = [random.randint(0, 25) for i in xrange(370000)]
badlist = range(7)
badtuple = tuple(badlist)
badset = set(badlist)
# List comprehension
%timeit [w for w in words if w not in badlist]
10 loops, best of 3: 59.2 ms per loop
%timeit [w for w in words if w not in badtuple]
10 loops, best of 3: 64.7 ms per loop
%timeit [w for w in words if w not in badset]
10 loops, best of 3: 30.3 ms per loop
# Filter
%timeit filter(lambda w: w not in badlist, words)
10 loops, best of 3: 85.6 ms per loop
%timeit filter(lambda w: w not in badtuple, words)
10 loops, best of 3: 92.1 ms per loop
%timeit filter(lambda w: w not in badset, words)
10 loops, best of 3: 50.8 ms per loop

结论：使用not in <set>列表理解作为过滤条件可能是最好的。

但正如我所说，这个基准是虚假的，你需要重复一些关于你会遇到的实际数据的基准，看看哪个更好。

关于为什么列表理解+＆＃34;不在集合中的一些想法＆＃34;可能是最佳的。

filter vs list comprehension：filter实际上调用输入可调用，Python中的可调用调用有自己的开销（创建堆栈帧等） ~~filter尝试变聪明并返回正确的类型，这会增加开销。~~（这实际上是无限小）相反，列表理解的条件检查（if ...子句）的开销小于呼叫。它只是表达式评估，没有Python调用堆栈的完整功能。
集合成员资格的测试平均情况为O（1），最坏情况下为O（n），但列表/元组成员资格总是O（n）。

Answer 2

您可以使用列表理解，例如：

words = [word for word in words if word not in ["a", "y", "Y", "q", "Q", "p", "P", "uh"]]

列表理解倾向于给予更多更好的表现。

编辑（感谢丛马的结果）：

似乎最佳性能来自使用set作为过滤器序列，因此您需要更类似的内容：

words = [word for word in words if word not in set(("a", "y", "Y", "q", "Q", "P", "uh"))]

Answer 3

＆＃34;但我不时会得到那些单个字符。＆＃34;

我认为这里的逻辑很差。将word插入列表时应删除它。在冗长的List之后删除它毕竟是一个糟糕的选择。

我遇到了同样的问题，起初我的解决方案是使用pypy

我认为当时pypy存在问题（我的代码突然退出），所以我用更好的逻辑更改代码，并使用普通的python。

Answer 4

尝试生成器管道;这个有一个简单的应用程序。生成器具有良好的性能，并且通常可以减少内存使用量，因为管道不会创建大量临时列表（尽管我的最终列表违反了此主体）。

bad_list = ["a", "y", "Y", "q", "Q", "p", "P", "uh"]

# here is the generator "comprehension"
good_word_stream = (word for word in open("lexicon") if word not in bad_list)

# ... and use the output for something
print [len(word) for word in good_word_stream]

Answer 5

当你有足够的内存时，动态修改列表并不是一个好主意，很容易就像通信所说的那样弄错了。

至于性能，list.remove是一个O（n）操作，因此你的代码是O（N ^ 2）。

列表理解要快得多，因为它占用更多空间 - 在Python 3中创建一个新的列表/或生成器，使用一个小的黑名单来过滤掉最终结果。虽然我不确定它是否会每次都创建["a", "y", "Y", "q", "Q", "p", "P", "uh"]，但是Cong Ma的删除答案提到创建这个小集合（是设置，in set（）是O（1）操作！）首先可能对性能有帮助。

而且，在我之前的测试中，列表理解比map或list(map(something))慢约25％，我现在无法证明它，但您可能想要进行测试。

Pypy / Cython将是最终的解决方案，如果你可以做的所有Python都完成并且性能仍然不符合生产要求..

Answer 6

translate 1.5 faster than list comprehensions it seems
tested in 10000 runs

def remove_chars(string_, word_):
    # 10000 0.112017
    string_ += string_.upper()
    vowels_table = dict.fromkeys(map(ord, string_))
    return word_.translate(vowels_table)


def remove_chars2(string_,word_):
    # 10000 0.166002
    return [c for c in word_ if not c in string_]

Python性能：从列表中删除项目

6 个答案: