在numpy数组中循环使用python生成器

时间:2015-05-24 13:30:44

标签: python performance numpy generator cython

我的代码中的速度瓶颈是两个数组x和y的元素循环的两倍。用于提高性能的标准hpc技巧是以块的形式进行循环,以便可以最小化缓存未命中。我正在尝试使用python生成器来进行分块,但是需要在外部for循环中不断重新创建耗尽的生成器,这会杀死我的运行时。

问题:

是否有更智能的算法来构造适当的生成器来执行chunked double-for循环?

具体说明:

我将创建两个虚拟数组, x y 。我会简短地说明它们,但实际上这些是带有~1e6元素的numpy数组。

x = np.array(['a', 'b', 'b', 'c', 'c', 'd'])
y = np.array(['e', 'f', 'f', 'g'])

朴素的双循环只是:

for xletter in x:
    for yletter in y:
        # algebraic manipulations on x & y

现在让我们使用生成器以块的形式执行此循环:

chunk_size = 3
xchunk_gen = (x[i: i+chunk_size] for i in range(0, len(x), chunk_size))
for xchunk in xchunk_gen:
    ychunk_gen = (y[i: i+chunk_size] for i in range(0, len(y), chunk_size))
    for ychunk in ychunk_gen:
        for xletter in xchunk:
            for yletter in ychunk:
                # algebraic manipulations on x & y

请注意,为了实现此问题的生成器解决方案,我必须在外部循环中不断重新创建 ychunk_gen 。由于 y 是一个大型数组,这会杀死我的运行时(对于~1e6元素,在我的笔记本电脑上创建这个生成器大约需要20ms)。

有没有办法让我更聪明地构建我的生成器来解决这个问题?或者是否有必要完全抛弃生成器解决方案?

(注意:在实践中,我使用cython来执行这个紧密循环,但无论如何都适用于上述所有情况)。

3 个答案:

答案 0 :(得分:3)

在我看来,你的生成器表达式的创建正在扼杀你的运行时间,因为它没有被cython优化。

更好的解决方案是使用numexpr来保持所有缓存优化。由于x和y的操作是algebric,它应该很好地适合你的约束(numexpr可以多做一点)

答案 1 :(得分:1)

您正在xchunk-loop中重新定义ychunk_gen。也许以下内容会更快:

chunk_size = 3
xchunk_gen = (x[i: i+chunk_size] for i in xrange(0, len(x), chunk_size))

def ychunk_gen(some_dependency_on_outer_loop):
    # use some_dependency_on_outer_loop
    for i in xrange(0, len(y), chunk_size):
        yield y[i: i+chunk_size]

for xchunk in xchunk_gen:
    for ychunk in ychunk_gen(chunk_or_something_else):
        for xletter in xchunk:
            for yletter in ychunk:
                # algebraic manipulations on x & y

但也许还有更好的方法:

我假设xynumpy数组,因此您可以重新整形数组,然后遍历每一行:

for xchunk in x.reshape((len(x)//chunk_size, chunk_size)):
    for ychunk in y.reshape((len(y)//chunk_size, chunk_size)):
        # the letter loops

http://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html中,我读到如果您希望reshape不要复制数据,则应更改数据的shape - 属性:

x.shape = len(x)//chunk_size, chunk_size 
y.shape = len(y)//chunk_size, chunk_size

答案 2 :(得分:0)

itertools.tee may give a modest time savings:

y = np.arange(100)
def foo1(y):
   # create ygen each loop
   # py3 so range is a generator
   for j in range(100):
       ygen=(y[i:i+10] for i in range(0,1000,10))
       r = [x.sum() for x in ygen]
   return r

def foo3(y):
   # use tee to replicate the gen
   ygen=(y[i:i+10] for i in range(0,1000,10))
   ygens=itertools.tee(ygen,100)
   for g in ygens:
       r=[x.sum() for x in g]
   return r

In [1123]: timeit foo3(y)
10 loops, best of 3: 108 ms per loop
In [1125]: timeit foo1(y)
10 loops, best of 3: 144 ms per loop

But based on

http://docs.cython.org/0.15/src/userguide/limitations.html#generators-and-generator-expressions

Since Cython 0.13, some generator expressions are supported when they can be transformed into inlined loops in combination with builtins, e.g. sum(x*2 for x in seq). As of 0.14, the supported builtins are list(), set(), dict(), sum(), any(), all(), sorted().

I wonder what cython is doing with your chunked generator expressions.


Reshaping and iterating on rows doesn't help much with time.

def foo4(y):
   y2d=y.reshape(100,10)
   for _ in range(100):
       r=[x.sum() for x in y2d]
   return r

is a bit slower than the teed generator. Of course relative timings like this could change with array size.