如何优化使用Numpy的连续值的for循环?

时间:2018-09-21 04:04:25

标签: python numpy vectorization

我正在尝试创建一个函数,该函数返回一个numpy.array且伪随机数均匀分布在0和1之间的数字。n可以在这里找到:{{3 }}

到目前为止,它运行良好。唯一的问题是,每个新值都是通过使用先前的值来计算的,因此,到目前为止,我发现的唯一解决方案是使用循环,为了提高效率,我尝试摆脱循环,可能是通过向量化操作-但是,我不知道该怎么做。

您对如何优化此功能有任何建议吗?

import numpy as np
import time

def unif(n):
    m = 2**32
    a = 1664525
    c = 1013904223

    result = np.empty(n)
    result[0] = int((time.time() * 1e7) % m)

    for i in range(1,n):
        result[i] = (a*result[i-1]+c) % m

    return result / m

3 个答案:

答案 0 :(得分:3)

尽管没有向量化,但我相信以下解决方案的速度提高了大约2倍(使用numba解决方案的速度提高了60倍)。它将每个result保存为局部变量,而不是按位置访问numpy数组。

def unif_improved(n):
    m = 2**32
    a = 1664525
    c = 1013904223

    results = np.empty(n)
    results[0] = result = int((time.time() * 1e7) % m)

    for i in range(1, n):
        result = results[i] = (a * result + c) % m

    return results / m

您也可以考虑使用Numba来提高速度。 https://numba.pydata.org/

只需添加装饰器@jit就可以吹开其他解决方案。

from numba import jit

@jit
def unif_jit(n):
    # Same code as `unif_improved`

时间

>>> %timeit -n 10 unif_original(500000)
715 ms ± 21.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit -n 10 unif_improved(500000)
323 ms ± 8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit -n 10 unif_jit(500000)
12 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

答案 1 :(得分:2)

这不可能完全做到,因为答案顺序地相互依赖。模块化算术的魔力确实意味着您可以通过以下更改(从@Alexander的建议修改为使用局部变量而不是数组查找进行修改)获得少量改进。

def unif_2(n):
    m = 2**32
    a = 1664525
    c = 1013904223

    results = np.empty(n)
    results[0] = result = int((time.time() * 1e7) % m)

    for i in range(1, n):
        result = results[i] = (a * result + c)

    return results % m / m

答案 2 :(得分:2)

更新:

利用模数为2^32的优势,我们可以消除所有Python循环并获得~91.1的加速。

允许使用任何模量,仍可以将线性长度回路减小为对数长度回路。对于500,000样本,这可以使~17.1的速度加快。如果我们预先计算了多步因子和偏移量(对于任何种子它们都是相同的),那么这将达到~44.8

代码:

import numpy as np
import time

def unif(n, seed):
    m = 2**32
    a = 1664525
    c = 1013904223

    result = np.empty(n)
    result[0] = seed

    for i in range(1,n):
        result[i] = (a*result[i-1]+c) % m

    return result / m

def precomp(n):
    l = n.bit_length()
    a, c = np.empty((2, 1+(1<<l)), np.uint64)
    m = 2**32
    a[:2] = 1, 1664525
    c[:2] = 0, 1013904223

    p = 1
    for j in range(l):
        a[1+p:1+(p<<1)] = a[p] * a[1:1+p] % m
        c[1+p:1+(p<<1)] = (a[p] * c[1:1+p] + c[p]) % m
        p <<= 1

    return a, c

def unif_opt(n, seed, a=None, c=None):
    if a is None:
        a, c = precomp(n)
    return (seed * a[:n] + c[:n]) % m / m

def unif_32(n, seed):
    out = np.empty((n,), np.uint32)
    out[0] = 1
    np.broadcast_to(np.uint32(1664525), (n-1,)).cumprod(out=out[1:])
    c = out[:-1].cumsum(dtype=np.uint32)
    c *= 1013904223
    out *= seed
    out[1:] += c
    return out / m

m = 2**32
seed = int((time.time() * 1e7) % m)
n = 500000
a, c = precomp(n)

print('results equal:', np.allclose(unif(n, seed), unif_opt(n, seed)) and 
      np.allclose(unif_opt(n, seed), unif_opt(n, seed, a, c)) and
      np.allclose(unif_32(n, seed), unif_opt(n, seed, a, c)))

from timeit import timeit

t = timeit('unif(n, seed)', globals=globals(), number=10)
t_opt = timeit('unif_opt(n, seed)', globals=globals(), number=10)
t_prc = timeit('unif_opt(n, seed, a, c)', globals=globals(), number=10)
t_32 = timeit('unif_32(n, seed)', globals=globals(), number=10)
print(f'speedup without precomp: {t/t_opt:.1f}')
print(f'speedup with precomp:    {t/t_prc:.1f}')
print(f'speedup special case:    {t/t_32:.1f}')

样品运行:

results equal: True
speedup without precomp: 17.1
speedup with precomp:    44.8
speedup special case:    91.1