Python中具有很多维度的直方图

时间:2020-07-14 15:35:20

标签: python numpy bigdata histogram

我正在对随机多体系统进行仿真,目前我需要从生成的数据中获得多维概率分布。为此,我尝试使用np.histogramdd,如下所示:

bins = np.linspace(start = -x_max, stop = x_max, num = n_bins)
hists = np.histogramdd(Data, bins = [bins] * dimensions, density = True)

但是,此代码已经为n_bins = 20dimensions = 5np.shape(Data) = (1000, 5)产生了一个MemoryError(或引发有关某个数组太大的异常),它远低于目标值。铲斗的数量随尺寸的增长呈指数增长,因此很容易看出为什么会出现此类问题。因此,问题是:如何在Python中生成,存储和处理大尺寸的直方图?是否有任何现有的框架?改用其他东西更好吗?

编辑:MCEV和错误代码示例。

x_max = 10 
n_bins = 20 
Data = np.random.uniform(-x_max, x_max, size=(1000, dimensions))

bins = np.linspace(start = -x_max, stop = x_max, num = n_bins)
hists = np.histogramdd(Data, bins = [bins] * dimensions, density = True)

放入dimensions = 7,我得到:

lib\site-packages\numpy\lib\histograms.py in histogramdd(sample, bins, range, normed, weights, density)
1066 # Compute the number of repetitions in xy and assign it to the
1067 # flattened histmat.
-> 1068  hist = np.bincount(xy, weights, minlength=nbin.prod())
MemoryError:

dimensions = 15

   1062     # Compute the sample indices in the flattened histogram matrix.
   1063     # This raises an error if the array is too large.
-> 1064     xy = np.ravel_multi_index(Ncount, nbin)
   1065 
   1066     # Compute the number of repetitions in xy and assign it to the

ValueError: invalid dims: array size defined by dims is larger than the maximum possible size. 

dimensions = 10

   1066     # Compute the number of repetitions in xy and assign it to the
   1067     # flattened histmat.
-> 1068     hist = np.bincount(xy, weights, minlength=nbin.prod())
   1069 
   1070     # Shape into a proper matrix

ValueError: 'minlength' must not be negative

1 个答案:

答案 0 :(得分:0)

如果直方图在每个轴上具有固定的bin宽度,则可以进行自己的簿记并为计数使用低内存数据类型(例如,每个bin 1字节)。在下面的示例中,每个轴的bin都相同,但是只要bin边缘等距,您就可以将其适应沿轴不同的bin范围。

此代码将不进行范围检查;您需要确保直方图箱的宽度足以容纳数据,否则会出现错误。

import numpy as np

x_max = 10 
n_dim = 7
n_data = 100000
data = np.random.uniform(-x_max, x_max-0.01, size=(n_data, n_dim))

# assume bins are the same for all dimensions. Bin edges at x0+i*xstep.
n_bins = 5
x0 = -x_max
xstep = 2*x_max/n_bins

# high-dimensional histogram
hist = np.zeros((n_bins,)*n_dim, dtype=np.int8)

# build the histogram indices corresponding to the data samples.
ii = ((data - x0)*(1/xstep)).astype(np.int16) # shape (n_data, n_dim)

# increment the histogram bins. The np.add.at will correctly handle 
# bins that occur multiple times in the input.
np.add.at(hist, tuple(ii.T), 1)

但是使用n_dim=89还是会在大多数系统上耗尽内存。

问题是,您将如何处理具有10**10个bin的直方图?您有10 ** 11个或更多样本吗?

保留ii数组并在需要时生成低维直方图更为实用。例如,如果要将轴0, 1, 5, 6上的7D直方图缩小为4D直方图:

hist_4d = np.zeros((n_bins,)*4, dtype=np.int16)
np.add.at(hist_4d, tuple(ii[:, [0, 1, 5, 6]].T), 1)

注意:建议将带符号整数用于bin计数。整数溢出将保持沉默,但是垃圾箱中至少有负数将表明您发生了溢出。