我有一个整数列表,我想写一个函数,它返回一个范围内的数字子集。像NumbersWithinRange(列表,间隔)函数名称......
即,
list = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]
interval = [4,20]
results = NumbersWithinRange(list, interval) # [4,4,6,8,7,8]
也许我忘了在结果中再写一个数字,但这就是想法......
列表长度可达10/20百万,范围通常为几百。
有关如何使用python有效地完成任何建议 - 我正在考虑使用bisect。
感谢。
答案 0 :(得分:6)
我会使用numpy,特别是如果列表那么长。例如:
In [101]: list = np.array([4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100])
In [102]: list
Out[102]:
array([ 4, 2, 1, 7, 9, 4, 3, 6, 8, 97, 7, 65, 3,
2, 2, 78, 23, 1, 3, 4, 5, 67, 8, 100])
In [103]: good = np.where((list > 4) & (list < 20))
In [104]: list[good]
Out[104]: array([7, 9, 6, 8, 7, 5, 8])
# %timeit says that numpy is MUCH faster than any list comprehension:
# create an array 10**6 random ints b/w 0 and 100
In [129]: arr = np.random.randint(0,100,1000000)
In [130]: interval = xrange(4,21)
In [126]: %timeit r = [x for x in arr if x in interval]
1 loops, best of 3: 14.2 s per loop
In [136]: %timeit good = np.where((list > 4) & (list < 20)) ; new_list = list[good]
100 loops, best of 3: 10.8 ms per loop
In [134]: %timeit r = [x for x in arr if 4 < x < 20]
1 loops, best of 3: 2.22 s per loop
In [142]: %timeit filtered = [i for i in ifilter(lambda x: 4 < x < 20, arr)]
1 loops, best of 3: 2.56 s per loop
答案 1 :(得分:5)
纯Python Python sortedcontainers module有SortedList类型可以帮助您。它按排序顺序自动维护列表,并经过测试,通过了数千万个元素。排序列表类型具有可以使用的平分功能。
from sortedcontainers import SortedList
data = SortedList(...)
def NumbersWithinRange(items, lower, upper):
start = items.bisect(lower)
end = items.bisect_right(upper)
return items[start:end]
subset = NumbersWithinRange(data, 4, 20)
与扫描整个列表相比,这种方式的对等和索引要快得多。已排序的容器模块非常快,并且具有performance comparison页面,其中包含针对其他实现的基准。
答案 2 :(得分:3)
如果未对列表进行排序,则需要扫描整个列表:
lst = [ 4,2,1,...]
interval=[4,20]
results = [ x for x in lst if interval[0] <= x <= interval[1] ]
如果列表 已排序,您可以使用bisect
查找左右索引
限制你的范围。
left = bisect.bisect_left(lst, interval[0])
right = bisect.bisect_right(lst, interval[1])
results = lst[left+1:right]
由于扫描列表是O( n )并且排序为O( n lg n ),因此可能不值得对其进行排序列表只是为了使用bisect
,除非您计划进行大量的范围提取。
答案 3 :(得分:2)
我认为这应该足够有效:
>>> nums = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]
>>> r = [x for x in nums if 4 <= x <21]
>>> r
[4, 7, 9, 4, 6, 8, 7, 4, 5, 8]
编辑:
在J.F.塞巴斯蒂安的出色观察之后,修改了代码。
答案 4 :(得分:1)
使用迭代器
>>> from itertools import ifilter
>>> A = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]
>>> [i for i in ifilter(lambda x: 4 < x < 20, A)]
[7, 9, 6, 8, 7, 5, 8]
答案 5 :(得分:1)
让我们创建一个类似于你所描述的列表:
import random
l = [random.randint(-100000,100000) for i in xrange(1000000)]
现在测试一些可能的解决方案:
interval=range(400,800)
def v2():
""" return a list """
return [i for i in l if i in interval]
def v3():
""" return a generator """
return list((i for i in l if i in interval))
def v4():
def te(x):
return x in interval
return filter(te,l)
def v5():
return [i for i in ifilter(lambda x: x in interval, l)]
print len(v2()),len(v3()), len(v4()), len(v5())
cmpthese.cmpthese([v2,v3,v4,v5],micro=True, c=2)
打印出来:
rate/sec usec/pass v5 v4 v2 v3
v5 0 6929225.922 -- -0.4% -1.0% -1.6%
v4 0 6903028.488 0.4% -- -0.6% -1.2%
v2 0 6861472.487 1.0% 0.6% -- -0.6%
v3 0 6817855.477 1.6% 1.2% 0.6% --
但是,请注意如果interval
是一个集合而不是列表会发生什么:
interval=set(range(400,800))
cmpthese.cmpthese([v2,v3,v4,v5],micro=True, c=2)
rate/sec usec/pass v5 v4 v3 v2
v5 5 201332.569 -- -20.6% -62.9% -64.6%
v4 6 159871.578 25.9% -- -53.2% -55.4%
v3 13 74769.974 169.3% 113.8% -- -4.7%
v2 14 71270.943 182.5% 124.3% 4.9% --
现在与numpy比较:
na=np.array(l)
def v7():
""" assume you have to convert from list => numpy array and return a list """
arr=np.array(l)
tgt = np.where((arr >= 400) & (arr < 800))
return [arr[x] for x in tgt][0].tolist()
def v8():
""" start with a numpy list but return a python list """
tgt = np.where((na >= 400) & (na < 800))
return na[tgt].tolist()
def v9():
""" numpy all the way through """
tgt = np.where((na >= 400) & (na < 800))
return [na[x] for x in tgt][0]
# or return na[tgt] if you prefer that syntax...
cmpthese.cmpthese([v2,v3,v4,v5, v7, v8,v9],micro=True, c=2)
rate/sec usec/pass v5 v4 v7 v3 v2 v8 v9
v5 5 185431.957 -- -17.4% -24.7% -63.3% -63.4% -93.6% -93.6%
v4 7 153095.007 21.1% -- -8.8% -55.6% -55.7% -92.3% -92.3%
v7 7 139570.475 32.9% 9.7% -- -51.3% -51.4% -91.5% -91.5%
v3 15 67983.985 172.8% 125.2% 105.3% -- -0.2% -82.6% -82.6%
v2 15 67861.438 173.3% 125.6% 105.7% 0.2% -- -82.5% -82.5%
v8 84 11850.476 1464.8% 1191.9% 1077.8% 473.7% 472.6% -- -0.0%
v9 84 11847.973 1465.1% 1192.2% 1078.0% 473.8% 472.8% 0.0% --
显然numpy比纯python更快,只要你可以一直使用numpy。否则,请使用间隔集来加速...
答案 6 :(得分:0)
我认为你正在寻找类似的东西......
b=[i for i in a if 4<=i<90]
print sorted(set(b))
[4, 5, 6, 7, 8, 9, 23, 65, 67, 78]
答案 7 :(得分:0)
如果数据集不太稀疏,则可以使用"bins"来存储和检索数据。例如:
a = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]
# Initalize a list of 0's [0, 0, ...]
# This is assuming that the minimum possible value is 0
bins = [0 for _ in range(max(a) + 1)]
# Update the bins with the frequency of each number
for i in a:
bins[i] += 1
def NumbersWithinRange(data, interval):
result = []
for i in range(interval[0], interval[1] + 1):
freq = data[i]
if freq > 0:
result += [i] * freq
return result
这适用于此测试用例:
print(NumbersWithinRange(bins, [4, 20]))
# [4, 4, 4, 5, 6, 7, 7, 8, 8, 9]
为简单起见,我在函数中省略了一些边界检查。
重申一下,这在空间和时间使用方面可能会更好,但是在很大程度上取决于您的特定数据集。数据集稀疏越少,效果越好。