我在广场内有很多分。我想在许多小矩形中划分正方形并检查每个矩形中有多少点落下,即我想计算点的联合概率分布。我报告了一些常识方法,使用循环并且效率不高:
% Data
N = 1e5; % number of points
xy = rand(N, 2); % coordinates of points
xy(randi(2*N, 100, 1)) = 0; % add some points on one side
xy(randi(2*N, 100, 1)) = 1; % add some points on the other side
xy(randi(N, 100, 1), :) = 0; % add some points on one corner
xy(randi(N, 100, 1), :) = 1; % add some points on one corner
inds= unique(randi(N, 100, 1)); xy(inds, :) = repmat([0 1], numel(inds), 1); % add some points on one corner
inds= unique(randi(N, 100, 1)); xy(inds, :) = repmat([1 0], numel(inds), 1); % add some points on one corner
% Intervals for rectangles
K1 = ceil(sqrt(N/5)); % number of intervals along x
K2 = K1; % number of intervals along y
int_x = [0:(1 / K1):1, 1+eps]; % intervals along x
int_y = [0:(1 / K2):1, 1+eps]; % intervals along y
% First approach
tic
count_cells = zeros(K1 + 1, K2 + 1);
for k1 = 1:K1+1
inds1 = (xy(:, 1) >= int_x(k1)) & (xy(:, 1) < int_x(k1 + 1));
for k2 = 1:K2+1
inds2 = (xy(:, 2) >= int_y(k2)) & (xy(:, 2) < int_y(k2 + 1));
count_cells(k1, k2) = sum(inds1 .* inds2);
end
end
toc
% Elapsed time is 46.090677 seconds.
% Second approach
tic
count_again = zeros(K1 + 2, K2 + 2);
for k1 = 1:K1+1
inds1 = (xy(:, 1) >= int_x(k1));
for k2 = 1:K2+1
inds2 = (xy(:, 2) >= int_y(k2));
count_again(k1, k2) = sum(inds1 .* inds2);
end
end
count_again_fix = diff(diff(count_again')');
toc
% Elapsed time is 22.903767 seconds.
% Check: the two solutions are equivalent
all(count_cells(:) == count_again_fix(:))
如何在时间,内存和可能避免循环方面更有效地做到这一点?
编辑 - &gt; 我刚刚发现了这个,它是目前为止找到的最佳解决方案:
tic
count_cells_hist = hist3(xy, 'Edges', {int_x int_y});
count_cells_hist(end, :) = []; count_cells_hist(:, end) = [];
toc
all(count_cells(:) == count_cells_hist(:))
% Elapsed time is 0.245298 seconds.
但它需要统计工具箱。
编辑 - &gt; chappjc建议的测试解决方案
tic
xcomps = single(bsxfun(@ge,xy(:,1),int_x));
ycomps = single(bsxfun(@ge,xy(:,2),int_y));
count_again = xcomps.' * ycomps; %' 143x143 = 143x1e5 * 1e5x143
count_again_fix = diff(diff(count_again')');
toc
% Elapsed time is 0.737546 seconds.
all(count_cells(:) == count_again_fix(:))
答案 0 :(得分:2)
我写了一个简单的mex函数,当N很大时,它非常有效。当然这是作弊但仍然......
功能是
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
unsigned long int hh, ctrl; /* counters */
unsigned long int N, m, n; /* size of matrices */
unsigned long int *xy; /* data */
unsigned long int *count_cells; /* joint frequencies */
/* matrices needed */
mxArray *count_cellsArray;
/* Now we need to get the data */
if (nrhs == 3) {
xy = (unsigned long int*) mxGetData(prhs[0]);
N = (unsigned long int) mxGetM(prhs[0]);
m = (unsigned long int) mxGetScalar(prhs[1]);
n = (unsigned long int) mxGetScalar(prhs[2]);
}
/* Then build the matrices for the output */
count_cellsArray = mxCreateNumericMatrix(m + 1, n + 1, mxUINT32_CLASS, mxREAL);
count_cells = mxGetData(count_cellsArray);
plhs[0] = count_cellsArray;
hh = 0; /* counter for elements of xy */
/* for all points from 1 to N */
for(hh=0; hh<N; hh++) {
ctrl = (m + 1) * xy[N + hh] + xy[hh];
count_cells[ctrl] = count_cells[ctrl] + 1;
}
}
它可以保存在文件“joint_dist_points_2D.c”中,然后编译:
mex joint_dist_points_2D.c
并查看:
% Data
N = 1e7; % number of points
xy = rand(N, 2); % coordinates of points
xy(randi(2*N, 1000, 1)) = 0; % add some points on one side
xy(randi(2*N, 1000, 1)) = 1; % add some points on the other side
xy(randi(N, 1000, 1), :) = 0; % add some points on one corner
xy(randi(N, 1000, 1), :) = 1; % add some points on one corner
inds= unique(randi(N, 1000, 1)); xy(inds, :) = repmat([0 1], numel(inds), 1); % add some points on one corner
inds= unique(randi(N, 1000, 1)); xy(inds, :) = repmat([1 0], numel(inds), 1); % add some points on one corner
% Intervals for rectangles
K1 = ceil(sqrt(N/5)); % number of intervals along x
K2 = ceil(sqrt(N/7)); % number of intervals along y
int_x = [0:(1 / K1):1, 1+eps]; % intervals along x
int_y = [0:(1 / K2):1, 1+eps]; % intervals along y
% Use Statistics Toolbox: hist3
tic
count_cells_hist = hist3(xy, 'Edges', {int_x int_y});
count_cells_hist(end, :) = []; count_cells_hist(:, end) = [];
toc
% Elapsed time is 4.414768 seconds.
% Use mex function
tic
xy2 = uint32(floor(xy ./ repmat([1 / K1, 1 / K2], N, 1)));
count_cells = joint_dist_points_2D(xy2, uint32(K1), uint32(K2));
toc
% Elapsed time is 0.586855 seconds.
% Check: the two solutions are equivalent
all(count_cells_hist(:) == count_cells(:))
答案 1 :(得分:0)
您的循环(以及嵌套点积)可以使用bsxfun
和矩阵乘法消除,如下所示:
xcomps = bsxfun(@ge,xy(:,1),int_x);
ycomps = bsxfun(@ge,xy(:,2),int_y);
count_again = double(xcomps).'*double(ycomps); %' 143x143 = 143x1e5 * 1e5x143
count_again_fix = diff(diff(count_again')');
乘法步骤完成sum(inds1 .* inds2)
中的AND和求和,但没有在密度矩阵上循环。 编辑:如果您使用single
代替double
,则执行时间几乎减半,但请确保将答案转换为double
或其他所需内容。其余的代码。在我的电脑上,这需要 0.5秒。
注意:使用rot90(count_again/size(xy,1),2)
您有CDF,而在rot90(count_again_fix/size(xy,1),2)
中您有PDF。
另一种方法是在我们分组数据后使用accumarray
来制作联合直方图。
从int_x
,int_y
,K1
,xy
等开始:
% take (0,1) data onto [1 K1], following A.Dondas approach for easy comparison
ii = floor(xy(:,1)*(K1-eps))+1; ii(ii<1) = 1; ii(ii>K1) = K1;
jj = floor(xy(:,2)*(K1-eps))+1; jj(jj<1) = 1; jj(jj>K1) = K1;
% create the histogram and normalize
H = accumarray([ii jj],ones(1,size(ii,1)));
PDF = H / size(xy,1); % for probabilities summing to 1
在我的电脑上,这需要 0.01秒。
输出与A.Donda从稀疏转换为完整(full(H)
)的输出相同。虽然,正如A.Donda指出的那样,维度为K1
x K1
是正确的,而不是OP {1}}中count_again_fix
的大小K1+1
} {X {1}}。
要获得CDF,我相信您只需将K1+1
应用于PDF的每个轴。
答案 2 :(得分:0)
chappjc的答案和使用hist3
都很好,但是因为我不久前想碰到这样的东西,并且由于某种原因找不到hist3
我自己写了,而我以为我会把它作为奖金发布在这里。它使用sparse
进行实际计数并将结果作为稀疏矩阵返回,因此它可能对处理不同模式相距很远的多模式分布很有用 - 或者对于没有统计工具箱的人来说
申请francesco的数据:
K1 = ceil(sqrt(N/5));
[H, xs, ys] = hist2d(xy(:, 1), xy(:, 2), [K1 K1], [0, 1 + eps, 0, 1 + eps]);
使用输出参数调用该函数只返回结果,而不会生成颜色图。
这是功能:
函数[H,xs,ys] = hist2d(x,y,n,ax)
% plot 2d-histogram as an image
%
% hist2d(x, y, n, ax)
% [H, xs, ys] = hist2d(x, y, n, ax)
%
% x: data for horizontal axis
% y: data for vertical axis
% n: how many bins to use for each axis, default is [100 100]
% ax: axis limits for the plot, default is [min(x), max(x), min(y), max(y)]
% H: 2d-histogram as a sparse matrix, indices 1 & 2 correspond to x & y
% xs: corresponding vector of x-values
% ys: corresponding vector of y-values
%
% x and y have to be column vectors of the same size. Data points
% outside of the axis limits are allocated to the first or last bin,
% respectively. If output arguments are given, no plot is generated;
% it can be reproduced by "imagesc(ys, xs, H'); axis xy".
% defaults
if nargin < 3
n = [100 100];
end
if nargin < 4
ax = [min(x), max(x), min(y), max(y)];
end
% parameters
nx = n(1);
ny = n(2);
xl = ax(1 : 2);
yl = ax(3 : 4);
% generate histogram
i = floor((x - xl(1)) / diff(xl) * nx) + 1;
i(i < 1) = 1;
i(i > nx) = nx;
j = floor((y - yl(1)) / diff(yl) * ny) + 1;
j(j < 1) = 1;
j(j > ny) = ny;
H = sparse(i, j, ones(size(i)), nx, ny);
% generate axes
xs = (0.5 : nx) / nx * diff(xl) + xl(1);
ys = (0.5 : ny) / ny * diff(yl) + yl(1);
% possibly plot
if nargout == 0
imagesc(ys, xs, H')
axis xy
clear H xs ys
end