Question

我需要多次过滤大型列表，但我关注代码的简单性和执行效率。举个例子：

all_things # huge collection of all things

# inefficient but clean code
def get_clothes():
    return filter(lambda t: t.garment, allThings)

def get_hats():
    return filter(lambda t: t.headgear, get_clothes())

我担心我正在迭代衣服清单，实际上它已经被迭代了。我还想将两个过滤器操作分开，因为它们属于两个不同的类，我不想复制hats类中的第一个lambda函数。

# efficient but duplication of code
def get_clothes():
    return filter(lambda t: t.garment, allThings)

def get_hats():
    return filter(lambda t: t.headgear and t.garment, allThings)

我一直在调查生成器功能，因为它们似乎是要走的路，但我还没有弄清楚如何。

Answer 1

首先使用filter / lambda组合将被弃用。当前的函数式编程风格在Python Functional Programming HOWTO中描述。

其次，如果您关注效率，而不是构建列表，则应返回generators。在这种情况下，它们很简单，可以使用generator expressions。

def get_clothes():
    return (t for t in allThings if t.garment)

def get_hats():
    return (t for t in get_clothes() if t.headgear)

或者，如果你愿意的话，真正的发电机（据称更加pythonic）：

def get_clothes():
    for t in allThings:
       if t.garment:
           yield t

def get_hats():
    for t in get_clothes():
        if t.headgear:
            yield t

如果出于某种原因，有时您需要list而不是iterator，则可以通过简单的转换构建列表：

hats_list = list(get_hats())

注意，上面将不构建衣服列表，因此效率接近您的重复代码版本。

Answer 2

我正在寻找类似的列表过滤，但希望格式与此处的内容略有不同。

上面的get_hats()调用很好，但重用次数有限。我正在寻找更像get_hats(get_clothes(all_things))的内容，您可以在其中指定来源(all_things)，然后根据需要指定少量或多个级别的过滤器get_hats()，get_clothes()。

我找到了一种方法来做生成器：

def get_clothes(in_list):
    for item in in_list:
        if item.garment:
            yield item

def get_hats(in_list):
    for item in in_list:
        if item.headgear:
            yield item

然后可以通过以下方式调用它：

get_hats(get_clothes(all_things))

我测试了原始解决方案，vartec的解决方案和这个额外的解决方案，以查看效率，并对结果感到有些惊讶。代码如下：

设定：

class Thing:
    def __init__(self):
        self.garment = False
        self.headgear = False

all_things = [Thing() for i in range(1000000)]

for i, thing in enumerate(all_things):
    if i % 2 == 0:
        thing.garment = True
    if i % 4 == 0:
        thing.headgear = True

原始解决方案：

def get_clothes():
    return filter(lambda t: t.garment, all_things)

def get_hats():
    return filter(lambda t: t.headgear, get_clothes())

def get_clothes2():
    return filter(lambda t: t.garment, all_things)

def get_hats2():
    return filter(lambda t: t.headgear and t.garment, all_things)

我的解决方案：

def get_clothes3(in_list):
    for item in in_list:
        if item.garment:
            yield item

def get_hats3(in_list):
    for item in in_list:
        if item.headgear:
            yield item

vartec的解决方案：

def get_clothes4():
    for t in all_things:
       if t.garment:
           yield t

def get_hats4():
    for t in get_clothes4():
        if t.headgear:
            yield t

时间码：

import timeit

print 'get_hats()'
print timeit.timeit('get_hats()', 'from __main__ import get_hats', number=1000)

print 'get_hats2()'
print timeit.timeit('get_hats2()', 'from __main__ import get_hats2', number=1000)

print '[x for x in get_hats3(get_clothes3(all_things))]'
print timeit.timeit('[x for x in get_hats3(get_clothes3(all_things))]',
                    'from __main__ import get_hats3, get_clothes3, all_things',
                    number=1000)

print '[x for x in get_hats4()]'
print timeit.timeit('[x for x in get_hats4()]',
                    'from __main__ import get_hats4', number=1000)

结果：

get_hats()
379.334653854
get_hats2()
232.768362999
[x for x in get_hats3(get_clothes3(all_things))]
214.376812935
[x for x in get_hats4()]
218.250688076

生成器表达式似乎稍快，我和vartec解决方案之间的时间差异可能只是噪音。但我更喜欢能够以任何顺序应用任何过滤器的灵活性。

Answer 3

仅在一次通过（伪代码）中执行：

clothes = list()
hats = list()
for thing in things:
    if thing is a garment:
        clothes.append(thing)
        if thing is a hat:
            hats.append(thing)

要在一个大传球和一个小传球（列表推导）中进行：

clothes = [ x for x in things if x is garment ]
hats = [ x for x in clothes if x is hat ]

如果你想创建整个列表，使用生成器表达式进行延迟评估是没有意义的，因为你不会懒惰。

如果您只想一次处理一些事情，或者您受内存限制，请使用@ vartec的生成器解决方案。

在Python 2.7中优化过滤列表

3 个答案: