我的代码中的速度瓶颈是两个数组x和y的元素循环的两倍。用于提高性能的标准hpc技巧是以块的形式进行循环,以便可以最小化缓存未命中。我正在尝试使用python生成器来进行分块,但是需要在外部for循环中不断重新创建耗尽的生成器,这会杀死我的运行时。
问题:
是否有更智能的算法来构造适当的生成器来执行chunked double-for循环?
具体说明:
我将创建两个虚拟数组, x 和 y 。我会简短地说明它们,但实际上这些是带有~1e6元素的numpy数组。
x = np.array(['a', 'b', 'b', 'c', 'c', 'd'])
y = np.array(['e', 'f', 'f', 'g'])
朴素的双循环只是:
for xletter in x:
for yletter in y:
# algebraic manipulations on x & y
现在让我们使用生成器以块的形式执行此循环:
chunk_size = 3
xchunk_gen = (x[i: i+chunk_size] for i in range(0, len(x), chunk_size))
for xchunk in xchunk_gen:
ychunk_gen = (y[i: i+chunk_size] for i in range(0, len(y), chunk_size))
for ychunk in ychunk_gen:
for xletter in xchunk:
for yletter in ychunk:
# algebraic manipulations on x & y
请注意,为了实现此问题的生成器解决方案,我必须在外部循环中不断重新创建 ychunk_gen 。由于 y 是一个大型数组,这会杀死我的运行时(对于~1e6元素,在我的笔记本电脑上创建这个生成器大约需要20ms)。
有没有办法让我更聪明地构建我的生成器来解决这个问题?或者是否有必要完全抛弃生成器解决方案?
(注意:在实践中,我使用cython来执行这个紧密循环,但无论如何都适用于上述所有情况)。
答案 0 :(得分:3)
在我看来,你的生成器表达式的创建正在扼杀你的运行时间,因为它没有被cython优化。
更好的解决方案是使用numexpr来保持所有缓存优化。由于x和y的操作是algebric,它应该很好地适合你的约束(numexpr可以多做一点)
答案 1 :(得分:1)
您正在xchunk-loop中重新定义ychunk_gen
。也许以下内容会更快:
chunk_size = 3
xchunk_gen = (x[i: i+chunk_size] for i in xrange(0, len(x), chunk_size))
def ychunk_gen(some_dependency_on_outer_loop):
# use some_dependency_on_outer_loop
for i in xrange(0, len(y), chunk_size):
yield y[i: i+chunk_size]
for xchunk in xchunk_gen:
for ychunk in ychunk_gen(chunk_or_something_else):
for xletter in xchunk:
for yletter in ychunk:
# algebraic manipulations on x & y
但也许还有更好的方法:
我假设x
和y
是numpy
数组,因此您可以重新整形数组,然后遍历每一行:
for xchunk in x.reshape((len(x)//chunk_size, chunk_size)):
for ychunk in y.reshape((len(y)//chunk_size, chunk_size)):
# the letter loops
在http://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html中,我读到如果您希望reshape
不要复制数据,则应更改数据的shape
- 属性:
x.shape = len(x)//chunk_size, chunk_size
y.shape = len(y)//chunk_size, chunk_size
答案 2 :(得分:0)
itertools.tee
may give a modest time savings:
y = np.arange(100)
def foo1(y):
# create ygen each loop
# py3 so range is a generator
for j in range(100):
ygen=(y[i:i+10] for i in range(0,1000,10))
r = [x.sum() for x in ygen]
return r
def foo3(y):
# use tee to replicate the gen
ygen=(y[i:i+10] for i in range(0,1000,10))
ygens=itertools.tee(ygen,100)
for g in ygens:
r=[x.sum() for x in g]
return r
In [1123]: timeit foo3(y)
10 loops, best of 3: 108 ms per loop
In [1125]: timeit foo1(y)
10 loops, best of 3: 144 ms per loop
But based on
http://docs.cython.org/0.15/src/userguide/limitations.html#generators-and-generator-expressions
Since Cython 0.13, some generator expressions are supported when they can be transformed into inlined loops in combination with builtins, e.g. sum(x*2 for x in seq). As of 0.14, the supported builtins are list(), set(), dict(), sum(), any(), all(), sorted().
I wonder what cython
is doing with your chunked generator expressions.
Reshaping and iterating on rows doesn't help much with time.
def foo4(y):
y2d=y.reshape(100,10)
for _ in range(100):
r=[x.sum() for x in y2d]
return r
is a bit slower than the teed
generator. Of course relative timings like this could change with array size.