python while循环来分解排序列表/数组

时间:2013-11-03 02:10:45

标签: python numpy

我有一个排序列表 - 实际上是按x排序的(x,y,z)三元组的巨大数组。 我的目标是根据x的范围将其分解成碎片。我一直在尝试

for triple in hugelist:  
    while triple[0] >= minx and triple[0] < maxx:  
        #do some stuff  
    # when out of that range, increase endpoints to the next range  
    minx = minx + deltax  
    maxx = maxx + deltax  
    # do some other stuff  
    # and hopefully move to next triple  

现在当然不行,因为我误用了,我理解为什么。但是,我无法想象如何通过列表。 hugelist大约有200万个三元组被分成大约600个块。如果可能的话,我希望只按顺序通过一次。

==============================

在Tim的帮助下,使用291点迷你列表,bisect错过了maxx应该去的地方:

while xstart < len(heights):   
    xfinish = bisect.bisect_left(heights, (maxx, 0, 0), lo=xstart)    
    xslice = heights[xstart:xfinish]  
    print "xstart is ", xstart, " xfinish is ", xfinish  
    print "maxx is ", maxx, " xslice is ", xslice  

    maxx += deltax   
    xstart = xfinish  


xstart is  0  xfinish is  291  
maxx is  804.0  xslice is  [(803.01, 1941.84, 0.74) (803.04, 1941.88, 0.45) (803.06, 1941.25, 0.0)
 (803.07, 1941.01, 0.0) (803.07, 1941.52, 0.31) (803.09, 1941.16, 0.08)
 (803.12, 1940.05, 0.0) (803.13, 1939.72, 0.3) (803.13, 1939.86, 0.11)
 (803.13, 1940.29, 0.17)  . . .  (803.23, 1938.24, 0.2)
 (803.23, 1938.25, 0.45) (803.23, 1938.29, 0.1) (803.23, 1938.36, 0.0)
 (803.23, 1938.49, 0.0) (803.96, 1941.06, 4.21) (**803.98**, 1940.6, 4.55)
 (**804.0**, 1940.32, 4.49) (**804.01**, 1940.68, 4.6) . . .  (806.11, 1934.82, 10.64)
 (806.11, 1934.86, 10.65) (806.11, 1934.91, 10.56) (806.32, 1933.24, 4.69)]

4 个答案:

答案 0 :(得分:2)

这是一种不同的,更有效的方法,利用列表进行排序:

from bisect import bisect_left

istart = 0
while istart < len(hugelist):
    ifinish = bisect_left(hugelist, (maxx, 0, 0), lo=istart)
    # Now work on the slice hugelist[istart:ifinish].
    # It's possible that istart == ifinish, i.e. that the
    # slice is empty!
    maxx += deltax
    istart = ifinish

使用二分查找将减少所需的比较次数。

编辑:来自评论:

  

如果您认为列表索引指向之间的,则会变得非常清楚   元素,最左边元素的“左边”,以及len(hugelist)   最右边元素的“右边”。然后bisect_left()返回   紧接在第一个三元组之前的位置,其第一个元素是> = maxx

一个例子真的应该有所帮助:

hugelist = [(0,0,0), (1,0,0), (3,0,0), (4,1,1), (4,2,2), (5,0,0)]
maxx = 0
deltax = 1
istart = 0
while istart < len(hugelist):
    ifinish = bisect_left(hugelist, (maxx, 0, 0), lo=istart)
    # Now work on the slice hugelist[istart:ifinish].
    # It's possible that istart == ifinish, i.e. that the
    # slice is empty!
    print "for maxx =", maxx, hugelist[istart:ifinish]
    maxx += deltax
    istart = ifinish

输出:

for maxx = 0 []
for maxx = 1 [(0, 0, 0)]
for maxx = 2 [(1, 0, 0)]
for maxx = 3 []
for maxx = 4 [(3, 0, 0)]
for maxx = 5 [(4, 1, 1), (4, 2, 2)]
for maxx = 6 [(5, 0, 0)]

这主要显示了终结者,这是任何理智的读者都会担心的; - )

答案 1 :(得分:1)

您只需使用if检查triple[0]是否在所需范围内。不需要内循环。如果列表按x值排序,则无需与最小值进行比较;检查它是否低于最大值。

for triple in hugelist:  
    if triple[0] < maxx:  
        #do some stuff  
    else:
        maxx = maxx + deltax  
        # do some other stuff  

根据您的目的,您还可以查看itertools.groupby

编辑:如果您在评论中说的目的是获取每个范围内z值的差异,那么您可以执行以下操作:

z_variances = []
z_group = []
maxx = deltax
for x, y, z in huge_list:
    if x < maxx:
        z_group.append(z)
    else:
        z_variances.append(var(z_group))
        z_group = [z]
        maxx += deltax

或使用groupby

z_variances = []
for _, group in itertools.groupby(huge_list, lambda x: int(x / deltax)):
    z_variances.append(var(z for x, y, z in group))

答案 2 :(得分:1)

首先,创建一个示例numpy数组:

>>> alen=300000
>>> huge=np.arange(alen).reshape(alen/3,3)
>>> huge
array([[     0,      1,      2],
       [     3,      4,      5],
       [     6,      7,      8],
       ..., 
       [299991, 299992, 299993],
       [299994, 299995, 299996],
       [299997, 299998, 299999]])

此语法将为您提供第一列:

>>> huge[:,0]
array([     0,      3,      6, ..., 299991, 299994, 299997])

由于您声明子数组已排序,您可以使用numpy.searchsorted将较大的数组分隔为存储桶。

让我们分成三分之一:

>>> minx=huge[-1][0]/3
>>> maxx=huge[-1][0]*2/3
>>> minx
99999
>>> maxx
199998

只需使用np.searchsorted测试你想要的范围内三元组的条件:

>>> np.searchsorted(huge[:,0],[minx,maxx])
array([33333, 66666])

然后将huge切片到所需的桶中:

>>> buckets=np.searchsorted(huge[:,0],[minx,maxx])
>>> bucket1=huge[0:buckets[0]]
>>> bucket2=huge[buckets[0]:buckets[1]]
>>> bucket3=huge[buckets[1]:]
>>> bucket1
array([[    0,     1,     2],
       [    3,     4,     5],
       [    6,     7,     8],
       ..., 
       [99990, 99991, 99992],
       [99993, 99994, 99995],
       [99996, 99997, 99998]])
>>> bucket2
array([[ 99999, 100000, 100001],
       [100002, 100003, 100004],
       [100005, 100006, 100007],
       ..., 
       [199989, 199990, 199991],
       [199992, 199993, 199994],
       [199995, 199996, 199997]])
>>> bucket3
array([[199998, 199999, 200000],
       [200001, 200002, 200003],
       [200004, 200005, 200006],
       ..., 
       [299991, 299992, 299993],
       [299994, 299995, 299996],
       [299997, 299998, 299999]])

您也可以使用np.histogram:

>>> edges=np.histogram(huge[:,0],[0,minx,maxx,huge[-1][0]])[1]
>>> b1=huge[edges[0]:edges[1]]
>>> b2=huge[edges[1]:edges[2]]
>>> b3=huge[edges[2]:edges[3]]

答案 3 :(得分:0)

如果你想要&#34;最多x&#34;,请使用itertools.takewhile

import itertools

li = [(1,2,3),(4,5,6),(7,8,9),(10,11,12),(13,14,15)]

list(itertools.takewhile(lambda x: x[0] < 10,li))
Out[78]: [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

如果您想为整个集合指定群组,那就是itertools.groupby

def grouper(x):
    if x < 5:
        return 0
    if x < 11:
        return 1
    return 2

for i,g in itertools.groupby(li,lambda x: grouper(x[0])):
    print('group {}: {}'.format(i,list(g)))

group 0: [(1, 2, 3), (4, 5, 6)]
group 1: [(7, 8, 9), (10, 11, 12)]
group 2: [(13, 14, 15)]