我正在尝试为非常大的数组(len> 1000000)获取数据片段(基于数组值)。请参阅下一个python代码,了解我在纯python中尝试做的一个例子:
vector=[1,2,3,4,5,6,7,8,9,10]
start=[1,4,9] # start and end lists have the same length
end=[2,7,9]
output=[[]]*len(start)
for indx1 in range(len(start)):
temp=[]
for indx2 in range(len(vector)):
if ( (vector[indx2]>=start[indx1]) and (vector[indx2]<=end[indx1]) ):
temp.append(vector[indx2])
output[indx1]=temp
print output
向量列表通常有25E + 6个元素,而开始和结束列表有1E6个元素,这就是为什么在纯python上执行此操作非常慢。
你知道为什么要使用numpy来避免for循环来解决这个问题吗?
感谢您的时间
答案 0 :(得分:1)
如果对矢量进行排序,则应该非常快:
import numpy as np
from itertools import izip
vector = np.array([2.0, 2.24, 3.1, 4.768, 16.8, 16.9,23.5,24.0])
start = np.array([3.0,4.5,6.5,15.2])
end = np.array([7.3,16.2,17.7,25.8])
start_i = vector.searchsorted(start, 'left')
end_i = vector.searchsorted(end, 'right')
output = [vector[s:e] for s, e in izip(start_i, end_i)]
print output
[array([ 3.1 , 4.768]), array([ 4.768]), array([ 16.8, 16.9]), array([ 16.8, 16.9, 23.5, 24. ])]
你也可以在纯python中使用类似的东西,它不是那么快但它不需要numpy:
from bisect import bisect_left, bisect_right
from itertools import izip
vector = [2.0, 2.24, 3.1, 4.768, 16.8, 16.9,23.5,24.0]
start = [3.0,4.5,6.5,15.2]
end = [7.3,16.2,17.7,25.8]
se = izip(start, end)
output = [vector[bisect_left(vector, s):bisect_right(vector, e)] for s, e in se]
print output
[[3.1, 4.768], [4.768], [16.8, 16.9], [16.8, 16.9, 23.5, 24.0]]