Question

所以我有两个数据集，它们的参数空间重叠：

我想把红色装箱加起来，找到每个箱子的标准偏差。然后，对于蓝色集中的每个点，我想找到该点对应的红色区域，并获取为该区域计算的标准差。

到目前为止，我一直在使用scipy.statistics.binned_2d，但我不确定从哪里开始：

import scipy.stats
import numpy as np
# given numpy recarrays red_set and blue_set with columns x,y,values

nbins = 50

red_bins = scipy.stats.binned_statistic_2d(red_set['x'],
                                           red_set['y'],
                                           red_set['values'],
                                           statistic = np.std,
                                           bins = nbins)

blue_bins = scipy.stats.binned_statistic_2d(blue_set['x']
                                            blue_set['y']
                                            blue_set['values']
                                            statistic = count,
                                            bins = red_bins[1],red_bins[2])

现在，我不知道如何为每个蓝点获取相应红色bin的值。我知道scipy.statistics.binned_2d的第三个返回值是每个输入数据点的binnumber，但我不知道如何将其转换为实际的计算统计量（本例中的标准偏差）。

我知道蓝色套装与红色完全相同（快速图将证实这一点）。似乎就像抓住相应的红色垃圾箱应该是完全直截了当的，但我无法弄明白。

如果我能让我的问题更清楚，请告诉我

Answer 1

您需要确保在合并数据时指定相同的range。这样，箱的相应指数将是一致的。我使用了较低级别的numpy函数hist2d，可以使用scipy.stats.binned_statistic_2d以相同的方式对标准偏差进行扩展，

import numpy as np
import matplotlib.pyplot as plt

#Setup random data
red = np.random.randn(100,2)
blue = np.random.randn(100,2)

#plot
plt.plot(red[:,0],red[:,1],'r.')
plt.plot(blue[:,0],blue[:,1],'b.')

#Specify limits of binned data
xmin = -3.; xmax = 3.
ymin = -3.; ymax = 3.

#Bin data using hist2d
rbins, xrb, yrb = np.histogram2d(red[:,0],red[:,1],bins=10,range=[[xmin,xmax],[ymin,ymax]])
bbins, xbb, ybb = np.histogram2d(blue[:,0],blue[:,1],bins=10,range=[[xmin,xmax],[ymin,ymax]])

#Check that bins correspond to the same positions in space
assert all(xrb == xbb)
assert all(yrb == ybb)

#Obtain centers of the bins and plots difference
xc = xrb[:-1] + 0.5 * (xrb[1:] - xrb[:-1])
yc = yrb[:-1] + 0.5 * (yrb[1:] - yrb[:-1])
plt.contourf(xc, yc, rbins-bbins, alpha=0.4)
plt.colorbar()
plt.show()

在两个数据集之间查找相应的bin

1 个答案: