我有许多不同大小的小numpy数组(组),我想尽可能快地连接这些组的任意子集。我最初想出的解决方案是将这些组存储为np.arrays的np.array,然后使用列表索引访问组的子集:
groups = []
for i in range(100000):
size = np.random.randint(3) + 1
groups.append(np.random.randint(1000000, size=size))
groups = np.array(groups) # dtype=np.object
indices = np.random.randint(len(groups), size=1000)
%%timeit
np.concatenate(groups[indices])
>>> 204 µs ± 395 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
但是,由于组很小(平均2个元素),因此此解决方案在内存消耗方面效率低下,并且我必须为每个组存储一个numpy数组结构,几乎是100个字节(对我来说太多了)。 / p>
为了使解决方案更有效,我决定将所有组连接起来并将数组边界存储在单独的数组中
data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups])
# ith group is data[offsets[i]: offsets[i + 1]]
但是,串联根本不明显。像这样:
%%timeit
np.concatenate([data[offsets[i]: offsets[i + 1]] for i in indices])
>>> 1.02 ms ± 44.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
工作速度比原始解决方案慢5倍。我认为这是由于两件事。首先,对numpy数组索引进行迭代(python将c-int封装到每个索引的对象中)。其次,python为每个切片/索引创建numpy结构。我认为在纯python中减少这种情况的连接时间是不可能的,因此我决定提出一个cython解决方案。
%%cython
import numpy as np
ctypedef long long int64
def concatenate(data, offsets, indices):
cdef int64[::] data_view = data
cdef int64[::] indices_view = indices
cdef int64[::] offsets_view = offsets
size = np.sum(offsets[indices + 1]) - np.sum(offsets[indices])
res = np.zeros(size, dtype=np.int64)
cdef int64[::] res_view = res
cdef int64 i, l = 0, r
for i in indices_view:
r = l + offsets_view[i + 1] - offsets_view[i]
res_view[l: r] = data_view[offsets_view[i]: offsets_view[i + 1]]
l = r
return res
%%timeit
concatenate(data, offsets, indices)
>>> 277 µs ± 89.8 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
此解决方案比以前的解决方案要快,但仍比原始解决方案要慢一些。但是最大的问题是我事先不知道数据类型。我在示例中使用了int64,但是它可以是任何数字类型,例如float32。因此,我不能像以前那样使用类型化的内存视图。从理论上讲,我只需要知道类型的大小(4/8字节),并且如果我有指向数据和结果数组的指针,就可以使用memcpy或类似的东西来复制切片。但是我不知道如何在cython中执行此操作。有办法吗?
答案 0 :(得分:1)
这是我纯的仅用于numpy的解决方案,adv_concatenate()
函数。与常规15x-47x
相比,它提供了np.concatenate()
倍的加速(在不同的计算机上有所不同)。
注意:在第一个代码之后,还有第二个更快的解决方案。
要测量用过的pip模块timerit
的时间,请通过python -m pip install timerit
安装一次。在计时方面,使用了两种类型的计算机-第一台计算机基于Windows,所有测试都相同,这是我的家用笔记本电脑,第二台计算机基于Linux,每种测试都使用不同的机器(因此不同测试之间的速度不同,但在一次运行/测试中却保持相同的速度),这也是我用来测试代码的repl.it网站的机器。
算法的想法是使用numpy的累积和函数(.cumsum()
):
1
,其大小等于结果连接数据数组的总大小。此数组将保存要提取的所有data
元素的索引,以创建结果数据数组。cumsum()
之后,该起始值将转换为起始偏移量为{{1 }}数组。剩余的值仍为data
s。1
。现在,所有值都将保留要提取的数据元素的正确索引。.cumsum()
并在上面的索引数组中形成索引,就形成了获取的数据数组。如果我们预先计算一些data
或offsets[1:] - offsets[:-1]
之类的值并在offsets[:-1] + 1
函数中使用它们,则可以进一步提高该算法的效率。
adv_concatenate()
输出:
在第一台计算机上(Windows):
# Needs: python -m pip install numpy timerit
from timerit import Timerit
import numpy as np
np.random.seed(0)
Timerit._default_asciimode = True
groups = []
for i in range(100000):
size = np.random.randint(3) + 1
groups.append(np.random.randint(1000000, size = size))
groups = np.array(groups) # dtype=np.object
indices = np.random.randint(len(groups), size = 1000)
data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups])
timer = lambda: Timerit(num = 600, verbose = 1)
print('np.concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
with t:
ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
tref = tim.mean()
def adv_concatenate(data, offsets, indices):
begs, ends = offsets[indices], offsets[indices + 1]
lens = ends - begs
clens = lens.cumsum()
ix = np.ones((clens[-1],), dtype = offsets.dtype)
ix[0] = begs[0]
ix[clens[:-1]] = begs[1:] - ends[:-1] + 1
ix = ix.cumsum()
return data[ix]
print('adv_concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
with t:
adv = adv_concatenate(data, offsets, indices)
tadv = tim.mean()
assert np.array_equal(ref, adv) # Check that our solution is correct
print('speedup:', round(tref / tadv, 3))
在第二台计算机(Linux)上:
np.concatenate(): Timed best=3.129 ms, mean=3.225 +- 0.1 ms
adv_concatenate(): Timed best=191.137 us, mean=208.012 +- 20.7 us
speedup: 15.504
与常规np.concatenate(): Timed best=1.666 ms, mean=2.314 +- 0.4 ms
adv_concatenate(): Timed best=35.596 us, mean=48.680 +- 15.4 us
speedup: 47.532
相比, 第二种解决方案的速度甚至比第一个解决方案快40x-150x
倍(在不同的计算机上有所不同)。但是第二种解决方案使用基于Numba JIT LLVM的编译器,该编译器需要通过np.concatenate()
安装。
尽管它使用了额外的python -m pip install numba
包,但中心函数numba
非常简单,与第一个解决方案中的代码行数相同。算法也很简单,只有两个简单的循环。
当前解决方案适用于任何数据类型,因为中央函数仅计算结果索引,因此根本不适用于adv_concatenate_indexes_numba()
的dtype。此外,如果代替计算索引,numba函数将计算直接的结果数据数组,则可以进一步提高data
来解决当前问题,但这仅适用于numba支持的非常简单的数据类型,包括所有数字类型。 Here is the code(或here)针对此改进的解决方案,可实现最高10%-90%
的加速!在第二台计算机(Linux)上此改进版本的时间:
250x
下一步是更通用的(仅索引计算)解决方案的代码:
np.concatenate(): Timed best=1.640 ms, mean=3.403 +- 1.9 ms
adv_concatenate_numba(): Timed best=12.669 us, mean=17.235 +- 6.9 us
speedup: 197.46
输出:
在第一台计算机上(Windows):
# Needs: python -m pip install numpy numba timerit
from timerit import Timerit
import numpy as np, numba
np.random.seed(0)
Timerit._default_asciimode = True
groups = []
for i in range(100000):
size = np.random.randint(3) + 1
groups.append(np.random.randint(1000000, size = size, dtype = np.int64))
groups = np.array(groups) # dtype=np.object
indices = np.random.randint(len(groups), size = 1000, dtype = np.int64)
data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups], dtype = np.int64)
timer = lambda: Timerit(num = 600, verbose = 1)
print('np.concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
with t:
ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
tref = tim.mean()
@numba.njit('i8[:](i8[:], i8[:])', cache = True)
def adv_concatenate_indexes_numba(offsets, indices):
tlen = 0
for i in range(indices.size):
ix = indices[i]
tlen += offsets[ix + 1] - offsets[ix]
pos, r = 0, np.empty((tlen,), dtype = offsets.dtype)
for i in range(indices.size):
ix = indices[i]
for j in range(offsets[ix], offsets[ix + 1]):
r[pos] = j
pos += 1
return r
def adv_concatenate2(data, offsets, indices):
return data[adv_concatenate_indexes_numba(offsets, indices)]
adv_concatenate2(data, offsets, indices) # Once pre-compile Numba
print('adv_concatenate2(): ', end = '', flush = True)
tim = timer()
for t in tim:
with t:
adv = adv_concatenate2(data, offsets, indices)
tadv = tim.mean()
assert np.array_equal(ref, adv) # Check that our solution is correct
print('speedup:', round(tref / tadv, 3))
在第二台计算机(Linux)上:
np.concatenate(): Timed best=3.201 ms, mean=3.356 +- 0.1 ms
adv_concatenate2(): Timed best=79.681 us, mean=82.991 +- 6.7 us
speedup: 40.442
受@pavelgramovich answer的Cython代码启发,我还决定使用循环(func np.concatenate(): Timed best=1.541 ms, mean=2.220 +- 0.7 ms
adv_concatenate2(): Timed best=12.012 us, mean=14.830 +- 4.8 us
speedup: 149.716
)而不是concatenate1()
版本(func memcpy()
),对于当前的测试数据,简化版本似乎比memcpy版本快concatenate0()
。完整的代码将以下两个版本进行了比较:
1.5-2x
输出:
第一台计算机(Windows):
# Needs: python -m pip install numpy timerit cython setuptools
from timerit import Timerit
import numpy as np
np.random.seed(0)
Timerit._default_asciimode = True
groups = []
for i in range(100000):
size = np.random.randint(3) + 1
groups.append(np.random.randint(1000000, size = size, dtype = np.int64))
groups = np.array(groups) # dtype=np.object
indices = np.random.randint(len(groups), size = 1000, dtype = np.int64)
data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups], dtype = np.int64)
timer = lambda: Timerit(num = 600, verbose = 1)
def compile_cy_cats():
src = """
import numpy as np
cimport numpy as np
cimport cython
from libc.string cimport memcpy
@cython.boundscheck(False)
@cython.wraparound(False)
def concatenate0(np.ndarray data, np.ndarray offsets, np.ndarray indices):
data = np.ascontiguousarray(data)
start_offsets = np.ascontiguousarray(offsets[indices], dtype=np.int64)
end_offsets = np.ascontiguousarray(offsets[indices + 1], dtype=np.int64)
cdef np.int64_t[::1] coffsets = start_offsets
cdef np.int64_t[::1] csizes = end_offsets - start_offsets
cdef np.int64_t i, total_size = 0
for i in range(csizes.shape[0]):
total_size += csizes[i]
res = np.empty(total_size, dtype=data.dtype)
cdef np.ndarray cdata = data
cdef np.ndarray cres = res
cdef np.int64_t itemsize = data.itemsize
cdef np.int64_t res_offset = 0
for i in range(csizes.shape[0]):
memcpy(cres.data + res_offset * itemsize,
cdata.data + coffsets[i] * itemsize,
csizes[i] * itemsize)
res_offset += csizes[i]
return res
@cython.boundscheck(False)
@cython.wraparound(False)
def concatenate1(np.int64_t[:] data, np.int64_t[:] offsets, np.int64_t[:] indices):
cdef np.int64_t tlen = 0, pos = 0, ix = 0, ixs = indices.size, i = 0, j = 0
for i in range(ixs):
ix = indices[i]
tlen += offsets[ix + 1] - offsets[ix]
r = np.empty(tlen, dtype = np.int64)
cdef np.int64_t[:] cr = r, cdata = data
for i in range(ixs):
ix = indices[i]
for j in range(offsets[ix], offsets[ix + 1]):
cr[pos] = cdata[j]
pos += 1
return r
"""
srcb = src.encode('utf-8')
import hashlib, os, glob, importlib
srch = hashlib.sha256(srcb).hexdigest().upper()[:8]
if len(glob.glob(f'cy{srch}*')) == 0:
with open(f'cys{srch}.pyx', 'wb') as f:
f.write(srcb)
import sys
from setuptools import setup, Extension
from Cython.Build import cythonize
import numpy as np
sys.argv += ['build_ext', '--inplace']
setup(
ext_modules = cythonize(
Extension(f'cy{srch}', [f'cys{srch}.pyx']), language_level = 3, annotate = True,
),
include_dirs = [np.get_include()],
)
del sys.argv[-2:]
print('Cython module:', f'cy{srch}')
return importlib.import_module(f'cy{srch}')
cy_cats = compile_cy_cats()
concatenate0, concatenate1 = cy_cats.concatenate0, cy_cats.concatenate1
print('np.concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
with t:
ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
tref = tim.mean()
concatenate0(data, offsets, indices) # Maybe pre-heat
print('cy_concatenate0(): ', end = '', flush = True)
tim = timer()
for t in tim:
with t:
adv0 = concatenate0(data, offsets, indices)
tadv0 = tim.mean()
assert np.array_equal(ref, adv0) # Check that our solution is correct
print('speedup:', round(tref / tadv0, 3))
concatenate1(data, offsets, indices) # Maybe pre-heat
print('cy_concatenate1(): ', end = '', flush = True)
tim = timer()
for t in tim:
with t:
adv1 = concatenate1(data, offsets, indices)
tadv1 = tim.mean()
assert np.array_equal(ref, adv1) # Check that our solution is correct
print('speedup:', round(tref / tadv1, 3))
第二台计算机(Linux):
Cython module: cy0BEBA0C8
np.concatenate(): Timed best=3.184 ms, mean=3.263 +- 0.1 ms
cy_concatenate0(): Timed best=119.767 us, mean=128.688 +- 10.7 us
speedup: 25.354
cy_concatenate1(): Timed best=86.525 us, mean=93.699 +- 20.5 us
speedup: 34.821