Question

我有一个元组列表，如下所述（此元组按第二个值的降序排序）：

from string import ascii_letters
myTup = zip (ascii_letters, range(10)[::-1])
threshold = 5.5

>>> myTup
[('a', 9), ('b', 8), ('c', 7), ('d', 6), ('e', 5), ('f', 4), ('g', 3), ('h', 2), \
('i', 1), ('j', 0)]

给定一个阈值，丢弃所有第二个值小于此阈值的元组的最佳方法是什么。

我有超过500万个元组，因此不希望按元组进行比较元组，因此删除或添加到另一个元组列表。

Answer 1

由于元组已排序，您只需搜索值低于阈值的第一个元组，然后使用切片表示法删除其余值：

index = next(i for i, (t1, t2) in enumerate(myTup) if t2 < threshold)
del myTup[index:]

正如Vaughn Cato指出的那样，二元搜索会加快速度。 bisect.bisect会很有用，除非您创建一个单独的键序列，但它不适用于您当前的数据结构，如文档here所示。但这违反了您禁止创建新列表的禁令。

但是，您可以使用source code作为自己的二进制搜索的基础。或者，您可以更改数据结构：

>>> myTup
[(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd'), (4, 'e'), (5, 'f'), 
 (6, 'g'), (7, 'h'), (8, 'i'), (9, 'j')]
>>> index = bisect.bisect(myTup, (threshold, None))
>>> del myTup[:index]
>>> myTup
[(6, 'g'), (7, 'h'), (8, 'i'), (9, 'j')]

这里的缺点是删除可能在线性时间内发生，因为Python必须将整个内存块移回......除非Python聪明地删除从0开始的切片。（有人知道吗？）

最后，如果你真的愿意改变你的数据结构，你可以这样做：

[(-9, 'a'), (-8, 'b'), (-7, 'c'), (-6, 'd'), (-5, 'e'), (-4, 'f'), 
 (-3, 'g'), (-2, 'h'), (-1, 'i'), (0, 'j')]
>>> index = bisect.bisect(myTup, (-threshold, None))
>>> del myTup[index:]
>>> myTup
[(-9, 'a'), (-8, 'b'), (-7, 'c'), (-6, 'd')]

（请注意，Python 3会抱怨None比较，因此您可以使用类似(-threshold, chr(0))的内容。）

我怀疑在一开始我建议的线性时间搜索在大多数情况下是可以接受的。

Answer 2

这是一种奇特的方法，在执行bisect之前将列表包装在类似列表的对象中。

import bisect

def revkey(items):
    class Items:
        def __getitem__(self, index):
            assert 0 <= index < _len
            return items[_max-index][1]
        def __len__(self):
            return _len
        def bisect(self, value):
            return _len - bisect.bisect_left(self, value)
    _len = len(items)
    _max = _len-1
    return Items()

tuples = [('a', 9), ('b', 8), ('c', 7), ('d', 6), ('e', 5), ('f', 4), ('g', 3), ('h', 2), ('i', 1), ('j', 0)]

for x in range(-2, 12):
    assert len(tuples) == 10
    t = tuples[:]
    stop = revkey(t).bisect(x)
    del t[stop:]
    assert t == [item for item in tuples if item[1] >= x]

Answer 3

代码可能比@Curious快一点：

newTup=[]
for tup in myTup:
    if tup[1]>threshold:
        newTup.append(tup)
    else:
        break

因为元组是有序的，所以你不需要经历所有元组。

另一种可能性是，使用二分法，并找到最后一个元素的索引i，它高于阈值。然后你会这样做：

newTup=myTup[:i]

我认为最后一种方法是最快的。

Answer 4

考虑到您正在处理的元组数量，您可能需要考虑使用NumPy。

定义structured array之类的

my_array= np.array(myTup, dtype=[('f0',"|S10"), ('f1',float)])

您可以使用myarray['f1']访问元组的第二个元素，它会为您提供一个float数组。您可以使用fancy indexing技术来过滤所需的元素，例如

my_array[myarray['f1'] < threshold]

仅保留f1小于threshold ..

的条目

Answer 5

您也可以使用itertools例如

from itertools import ifilter
iterable_filtered = ifilter(lambda x : x[1] > threshold, myTup)

如果你想要一个可迭代的过滤列表或只是：

filtered = filter(lambda x: x[1] > threshold, myTup)

直接进入清单。

我不太熟悉这些方法的相对性能，必须对它们进行测试（例如在IPython using %timeit中）。

智能删除元组的方法

5 个答案: