numpy数组:快速填充和提取数据

时间:2011-04-05 23:40:19

标签: python arrays performance numpy loading

3 个答案:

答案 0 :(得分:5)

您将数据描述为“坐标列表列表”。从这里我猜你的提取看起来像这样:

for x in points:
   for y in x:
       for Z in y:
           # z is a tuple with GPS coordinates

这样做:

# initially, points is a list of lists of lists
points = itertools.chain.from_iterable(points)
# now points is an iterable producing lists
points = itertools.chain.from_iterable(points)
# now points is an iterable producing coordinates
points = itertools.chain.from_iterable(points)
# now points is an iterable producing individual floating points values
data = numpy.fromiter(points, float)
# data is a numpy array containing all the coordinates
data = data.reshape( data.size/2,2)
# data has now been reshaped to be an nx2 array

itertools和numpy.fromiter都是用c实现的,效率很高。因此,这应该很快进行转换。

问题的第二部分并未真正说明您想要对数据做什么。索引numpy数组比索引python列表要慢。通过对数据执行大量操作,您可以获得速度。如果不了解您正在使用该数据做什么,很难建议如何解决它。

<强>更新

我已经使用itertools和numpy完成了所有事情。我对因试图理解此代码而造成的任何脑损伤概不负责。

# firstly, we use imap to call GetMyPoints a bunch of times
objects = itertools.imap(GetMyPoints, xrange(100))
# next, we use itertools.chain to flatten it into all of the polygons
polygons = itertools.chain.from_iterable(objects)
# tee gives us two iterators over the polygons
polygons_a, polygons_b = itertools.tee(polygons)
# the lengths will be the length of each polygon
polygon_lengths = itertools.imap(len, polygons_a)
# for the actual points, we'll flatten the polygons into points
points = itertools.chain.from_iterable(polygons_b)
# then we'll flatten the points into values
values = itertools.chain.from_iterable(points)

# package all of that into a numpy array
all_points = numpy.fromiter(values, float)
# reshape the numpy array so we have two values for each coordinate
all_points = all_points.reshape(all_points.size // 2, 2)

# produce an iterator of lengths, but put a zero in front
polygon_positions = itertools.chain([0], polygon_lengths)
# produce another numpy array from this
# however, we take the cumulative sum
# so that each index will be the starting index of a polygon
polygon_positions = numpy.cumsum( numpy.fromiter(polygon_positions, int) )

# now for the transformation
# multiply the first coordinate of every point by *.5
all_points[:,0] *= .5

# now to get it out

# polygon_positions is all of the starting positions
# polygon_postions[1:] is the same, but shifted on forward,
# thus it gives us the end of each slice
# slice makes these all slice objects
slices = itertools.starmap(slice, itertools.izip(polygon_positions, polygon_positions[1:]))
# polygons produces an iterator which uses the slices to fetch
# each polygon
polygons = itertools.imap(all_points.__getitem__, slices)

# just iterate over the polygon normally
# each one will be a slice of the numpy array
for polygon in polygons:
    draw_polygon(polygon)

您可能会发现最好一次处理一个多边形。将每个多边形转换为numpy数组并对其执行向量运算。这样做你可能会获得显着的速度优势。将所有数据放入numpy可能有点困难。

这比大多数numpy的东西更难,因为你形状奇特的数据。 Numpy几乎假设一个统一形状数据的世界。

答案 1 :(得分:2)

这会更快:

numpy.array(point_buffer, dtype=numpy.float32)

修改数组,而不是列表。如果可能的话,最好避免首先创建列表。

编辑1:分析

下面是一些测试代码,它们演示了numpy如何有效地将列表转换为数组(这很好)。我的列表到缓冲区的想法只能与numpy相提并论,而不是更好。

import timeit

setup = '''
import numpy
import itertools
import struct
big_list = numpy.random.random((10000,2)).tolist()'''

old_way = '''
a = numpy.empty(( len(big_list), 2), numpy.float32)
for i,e in enumerate(big_list):
    a[i] = e
'''

normal_way = '''
a = numpy.array(big_list, dtype=numpy.float32)
'''

iter_way = '''
chain = itertools.chain.from_iterable(big_list)
a = numpy.fromiter(chain, dtype=numpy.float32)
'''

my_way = '''
chain = itertools.chain.from_iterable(big_list)
buffer = struct.pack('f'*len(big_list)*2,*chain)
a = numpy.frombuffer(buffer, numpy.float32)
'''

for way in [old_way, normal_way, iter_way, my_way]:
    print timeit.Timer(way, setup).timeit(1)

结果:

0.22445492374
0.00450378469941
0.00523579114088
0.00451488946237

编辑2:关于数据的层次性

如果我理解数据总是列表列表(对象 - 多边形 - 坐标),那么这就是我采用的方法:将数据减少到创建正方形数组的最低维度(2D in这种情况)并使用单独的数组跟踪更高级别分支的索引。这本质上是Winston使用itertools链对象的numpy.fromiter的想法的实现。唯一增加的想法是分支索引。

import numpy, itertools

# heirarchical list of lists of coord pairs
polys = [numpy.random.random((n,2)).tolist() for n in [5,7,12,6]]

# get the indices of the polygons:
lengs = numpy.array([0]+[len(l) for l in polys])
p_idxs = numpy.add.accumulate(lengs)

# convert the flattend list to an array:
chain = itertools.chain.from_iterable
a = numpy.fromiter(chain(chain(polys)), dtype=numpy.float32).reshape(lengs.sum(), 2)

# transform the coords
a *= .5

# get a transformed polygon (using the indices)
def get_poly(n):
    i0 = p_idxs[n]
    i1 = p_idxs[n+1]
    return a[i0:i1]

print 'poly2', get_poly(2)
print 'poly0', get_poly(0)

答案 2 :(得分:2)

使用numpy数组的目的是尽可能避免循环。自己编写循环会导致代码变慢,但是对于numpy数组,你可以使用预定义的矢量化函数,这些函数更快(更容易!)。

因此,对于将列表转换为数组,您可以使用:

point_buffer = np.array(point_list)

如果列表包含(lat, lon)等元素,则会将其转换为包含两列的数组。

使用那个numpy数组,您可以轻松地一次操作所有元素。例如,要像在你的问题中那样将每个坐标对的第一个元素乘以0.5,你可以简单地做(假设第一个元素例如在第一列中):

point_buffer[:,0] * 0.5