有没有办法计算沿nD阵列轴的许多直方图?我目前使用的方法使用## Unit: milliseconds
## expr min lq mean median uq max
## original_add(dt, "a", "b", "c") 2.493553 3.499039 6.585651 3.607101 4.390051 114.0612
## my_add(dt, "a", "b", "c") 11.821820 14.512878 28.387841 17.412433 19.642231 117.6359
## list_access_add(dt, "a", "b", "c") 2.161276 3.133110 6.874885 3.218185 3.407776 107.6853
## david_add(dt, "a", "b", "c") 2.237089 3.313133 6.047832 3.381757 3.788558 103.7532
## neval
## 100
## 100
## 100
## 100
循环迭代所有其他轴,并为每个生成的1D数组计算for
:
numpy.histogram()
毋庸置疑,这是非常缓慢的,但我找不到使用import numpy
import itertools
data = numpy.random.rand(4, 5, 6)
# axis=-1, place `200001` and `[slice(None)]` on any other position to process along other axes
out = numpy.zeros((4, 5, 200001), dtype="int64")
indices = [
numpy.arange(4), numpy.arange(5), [slice(None)]
]
# Iterate over all axes, calculate histogram for each cell
for idx in itertools.product(*indices):
out[idx] = numpy.histogram(
data[idx],
bins=2 * 100000 + 1,
range=(-100000 - 0.5, 100000 + 0.5),
)[0]
out.shape # (4, 5, 200001)
,numpy.histogram
或numpy.histogram2d
解决此问题的方法。
答案 0 :(得分:6)
这是一种利用高效工具np.searchsorted
和np.bincount
的矢量化方法。 searchsorted
为我们提供了基于垃圾箱放置每个元素的信息,bincount
为我们计算了数据。
实施 -
def hist_laxis(data, n_bins, range_limits):
# Setup bins and determine the bin location for each element for the bins
R = range_limits
N = data.shape[-1]
bins = np.linspace(R[0],R[1],n_bins+1)
data2D = data.reshape(-1,N)
idx = np.searchsorted(bins, data2D,'right')-1
# Some elements would be off limits, so get a mask for those
bad_mask = (idx==-1) | (idx==n_bins)
# We need to use bincount to get bin based counts. To have unique IDs for
# each row and not get confused by the ones from other rows, we need to
# offset each row by a scale (using row length for this).
scaled_idx = n_bins*np.arange(data2D.shape[0])[:,None] + idx
# Set the bad ones to be last possible index+1 : n_bins*data2D.shape[0]
limit = n_bins*data2D.shape[0]
scaled_idx[bad_mask] = limit
# Get the counts and reshape to multi-dim
counts = np.bincount(scaled_idx.ravel(),minlength=limit+1)[:-1]
counts.shape = data.shape[:-1] + (n_bins,)
return counts
运行时测试
原创方法 -
def org_app(data, n_bins, range_limits):
R = range_limits
m,n = data.shape[:2]
out = np.zeros((m, n, n_bins), dtype="int64")
indices = [
np.arange(m), np.arange(n), [slice(None)]
]
# Iterate over all axes, calculate histogram for each cell
for idx in itertools.product(*indices):
out[idx] = np.histogram(
data[idx],
bins=n_bins,
range=(R[0], R[1]),
)[0]
return out
计时和验证 -
In [2]: data = np.random.randn(4, 5, 6)
...: out1 = org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5))
...: out2 = hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))
...: print np.allclose(out1, out2)
...:
True
In [3]: %timeit org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5))
10 loops, best of 3: 39.3 ms per loop
In [4]: %timeit hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))
100 loops, best of 3: 3.17 ms per loop
因为在循环解决方案中,我们正在循环前两个轴。所以,让我们增加它们的长度,因为这将向我们展示向量化的有多好 -
In [59]: data = np.random.randn(400, 500, 6)
In [60]: %timeit org_app(data, n_bins=21, range_limits=(- 2.5, 2.5))
1 loops, best of 3: 9.59 s per loop
In [61]: %timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5))
10 loops, best of 3: 44.2 ms per loop
In [62]: 9590/44.2 # Speedup number
Out[62]: 216.9683257918552
答案 1 :(得分:1)
第一个解决方案提供了一个很好的简短成语,它使用了numpy sortedsearch
,这需要花费大量的时间并进行大量搜索。但是numpy在其源代码中有一条快速路线,实际上是在Python中完成的,它可以在数学上处理相等的bin边缘范围。此解决方案仅使用矢量化的减法和乘法以及一些比较。
此解决方案将遵循numpy代码对搜索进行排序,键入归因,并处理权重以及复数。基本上,这是第一个结合numpy直方图快速路由,一些额外类型和迭代详细信息等的解决方案。
_range = range
def hist_np_laxis(a, bins=10, range=None, weights=None):
# Initialize empty histogram
N = a.shape[-1]
data2D = a.reshape(-1,N)
limit = bins*data2D.shape[0]
# gh-10322 means that type resolution rules are dependent on array
# shapes. To avoid this causing problems, we pick a type now and stick
# with it throughout.
bin_type = np.result_type(range[0], range[1], a)
if np.issubdtype(bin_type, np.integer):
bin_type = np.result_type(bin_type, float)
bin_edges = np.linspace(range[0],range[1],bins+1, endpoint=True, dtype=bin_type)
# Histogram is an integer or a float array depending on the weights.
if weights is None:
ntype = np.dtype(np.intp)
else:
ntype = weights.dtype
n = np.zeros(limit, ntype)
# Pre-compute histogram scaling factor
norm = bins / (range[1] - range[0])
# We set a block size, as this allows us to iterate over chunks when
# computing histograms, to minimize memory usage.
BLOCK = 65536
# We iterate over blocks here for two reasons: the first is that for
# large arrays, it is actually faster (for example for a 10^8 array it
# is 2x as fast) and it results in a memory footprint 3x lower in the
# limit of large arrays.
for i in _range(0, data2D.shape[0], BLOCK):
tmp_a = data2D[i:i+BLOCK]
block_size = tmp_a.shape[0]
if weights is None:
tmp_w = None
else:
tmp_w = weights[i:i + BLOCK]
# Only include values in the right range
keep = (tmp_a >= range[0])
keep &= (tmp_a <= range[1])
if not np.logical_and.reduce(np.logical_and.reduce(keep)):
tmp_a = tmp_a[keep]
if tmp_w is not None:
tmp_w = tmp_w[keep]
# This cast ensures no type promotions occur below, which gh-10322
# make unpredictable. Getting it wrong leads to precision errors
# like gh-8123.
tmp_a = tmp_a.astype(bin_edges.dtype, copy=False)
# Compute the bin indices, and for values that lie exactly on
# last_edge we need to subtract one
f_indices = (tmp_a - range[0]) * norm
indices = f_indices.astype(np.intp)
indices[indices == bins] -= 1
# The index computation is not guaranteed to give exactly
# consistent results within ~1 ULP of the bin edges.
decrement = tmp_a < bin_edges[indices]
indices[decrement] -= 1
# The last bin includes the right edge. The other bins do not.
increment = ((tmp_a >= bin_edges[indices + 1])
& (indices != bins - 1))
indices[increment] += 1
((bins*np.arange(i, i+block_size)[:,None] * keep)[keep].reshape(indices.shape) + indices).reshape(-1)
#indices = scaled_idx.reshape(-1)
# We now compute the histogram using bincount
if ntype.kind == 'c':
n.real += np.bincount(indices, weights=tmp_w.real,
minlength=limit)
n.imag += np.bincount(indices, weights=tmp_w.imag,
minlength=limit)
else:
n += np.bincount(indices, weights=tmp_w,
minlength=limit).astype(ntype)
n.shape = a.shape[:-1] + (bins,)
return n
data = np.random.randn(4, 5, 6)
out1 = hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5))
out2 = hist_np_laxis(data, bins=200001, range=(- 2.5, 2.5))
print(np.allclose(out1, out2))
True
%timeit hist_np_laxis(data, bins=21, range=(- 2.5, 2.5))
92.1 µs ± 504 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5))
55.1 µs ± 3.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
尽管在较小的示例甚至较大的示例中,第一个解决方案都更快:
data = np.random.randn(400, 500, 6)
%timeit hist_np_laxis(data, bins=21, range=(- 2.5, 2.5))
264 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5))
71.6 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
这并不总是更快:
data = np.random.randn(400, 6, 500)
%timeit hist_np_laxis(data, bins=101, range=(- 2.5, 2.5))
71.5 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit hist_laxis(data, n_bins=101, range_limits=(- 2.5, 2.5))
76.9 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
但是,仅当最后一个轴较大时,numpy的变化才会更快。并且其增加幅度非常小。在我尝试过的所有其他情况下,无论仓位数量和前两个尺寸的大小如何,第一个解决方案都快得多。尽管我还没有找到一种更快的方法,但是唯一重要的一行((bins*np.arange(i, i+block_size)[:,None] * keep)[keep].reshape(indices.shape) + indices).reshape(-1)
可能会更加乐观。
这也意味着O(n)的向量化操作的绝对数量超过了排序和重复增量搜索的O(n log n)。
但是,实际用例的最后一个轴包含大量数据,而先前的轴包含少量数据。因此,实际上,第一种解决方案中的样本太虚构,无法满足所需的性能。
直方图的轴加法在numpy仓库https://github.com/numpy/numpy/issues/13166中被视为一个问题。
xhistogram库也试图解决此问题:https://xhistogram.readthedocs.io/en/latest/