Question

P是一个n * d矩阵，包含n个d维样本。某些地区P的密度是其他地区的几倍。我想选择P的一个子集，其中任何样本对之间的距离大于d0，我需要它遍布整个区域。所有样本都具有相同的优先级，并且不需要优化任何内容（例如覆盖面积或成对距离之和）。

这是一个示例代码，但它确实很慢。我需要一个更有效的代码，因为我需要多次调用它。

%% generating sample data
n_4 = 1000; n_2 = n_4*2;n = n_4*4;
x1=[ randn(n_4, 1)*10+30; randn(n_4, 1)*3 + 60];
y1=[ randn(n_4, 1)*5 + 35; randn(n_4, 1)*20 + 80 ];
x2 = rand(n_2, 1)*(max(x1)-min(x1)) + min(x1);
y2 = rand(n_2, 1)*(max(y1)-min(y1)) + min(y1);
P = [x1,y1;x2, y2];
%% eliminating close ones
tic
d0 = 1.5;
D = pdist2(P, P);D(1:n+1:end) = inf;
E = zeros(n, 1); % eliminated ones
for i=1:n-1
    if ~E(i)
        CloseOnes = (D(i,:)<d0) & ((1:n)>i) & (~E');
        E(CloseOnes) = 1;
    end
end
P2 = P(~E, :);
toc
%% plotting samples
subplot(121); scatter(P(:, 1), P(:, 2)); axis equal;
subplot(122); scatter(P2(:, 1), P2(:, 2)); axis equal;

编辑：子集应该有多大？

正如j_random_hacker在评论中指出的那样，如果我们没有对所选样本的数量进行约束，可以说P(1, :)是最快的答案。它精巧地显示了标题的不连贯性！但我认为目前的标题更好地描述了目的。因此，让我们定义一个约束：“如果可能的话，尝试选择m个样本”。现在有了m=n的隐含假设，我们可以获得最大可能的子集。正如我之前提到的，更快的方法优于找到最佳答案的方法。

Answer 1

反复查找最近的点表明针对空间搜索优化的不同数据结构。我建议进行delaunay三角测量。

以下解决方案是“近似”的，因为它可能会删除多于严格必要的点数。我正在对所有计算进行批处理，并在每次迭代中删除所有点，这些点会导致距离太长，并且在许多情况下，删除一个点可能会删除在同一次迭代中稍后出现的边。如果这很重要，可以进一步处理边缘列表以避免重复，甚至找到要移除的点，这将影响最大距离。

这很快。

dt = delaunayTriangulation(P(:,1), P(:,2));
d0 = 1.5;

while 1
    edge = edges(dt);  % vertex ids in pairs

    % Lookup the actual locations of each point and reorganize
    pwise = reshape(dt.Points(edge.', :), 2, size(edge,1), 2);
    % Compute length of each edge
    difference = pwise(1,:,:) - pwise(2,:,:);
    edge_lengths = sqrt(difference(1,:,1).^2 + difference(1,:,2).^2);

    % Find edges less than minimum length
    idx = find(edge_lengths < d0);
    if(isempty(idx))
        break;
    end

    % pick first vertex of each too-short edge for deletion
    % This could be smarter to avoid overdeleting
    points_to_delete = unique(edge(idx, 1));

    % remove them.  triangulation auto-updates
    dt.Points(points_to_delete, :) = [];

    % repeat until no edge is too short
end

P2 = dt.Points;

Answer 2

您没有指定要选择的点数。这对这个问题至关重要。

我不太愿意看到优化方法的方法。

假设欧几里德距离作为距离测量是可接受的，当仅选择少量点时，以下实现要快得多，并且即使在尝试使用全部＆＃39;有效点（请注意，找到最大可能点数很难）。

%%
figure;
subplot(121); scatter(P(:, 1), P(:, 2)); axis equal;

d0 = 1.5;

m_range = linspace(1, 2000, 100);
m_time = NaN(size(m_range));

for m_i = 1:length(m_range);
    m = m_range(m_i)

    a = tic;
    % Test points in random order.
    r = randperm(n);
    r_i = 1;

    S = false(n, 1); % selected ones
    for i=1:m
        found = false;

        while ~found
            j = r(r_i);
            r_i = r_i + 1;
            if r_i > n
                % We have tried all points. Nothing else can be valid.
                break;
            end
            if sum(S) == 0
                % This is the first point.
                found = true;
            else
                % Get the points already selected
                P_selected = P(S, :);
                % Exclude points >= d0 along either axis - they cannot have
                % a Euclidean distance less than d0.
                P_valid = (abs(P_selected(:, 1) - P(j, 1)) < d0) & (abs(P_selected(:, 2) - P(j, 2)) < d0);
                if sum(P_valid) == 0
                    % There are no points that can be < d0.
                    found = true;
                else
                    % Implement Euclidean distance explicitly rather than
                    % using pdist - this makes a large difference to
                    % timing.
                    found = min(sqrt(sum((P_selected(P_valid, :) - repmat(P(j, :), sum(P_valid), 1)) .^ 2, 2))) >= d0;
                end
            end
        end
        if found
            % We found a valid point - select it.
            S(j) = true;
        else
            % Nothing found, so we must have exhausted all points.
            break;
        end
    end
    P2 = P(S, :);
    m_time(m_i) = toc(a);
    subplot(122); scatter(P2(:, 1), P2(:, 2)); axis equal;
    drawnow;
end
%%
figure
plot(m_range, m_time);
hold on;
plot(m_range([1 end]), ones(2, 1) * original_time);
hold off;

其中original_time是您的方法所花费的时间。这给出了以下时间，其中红线是您的方法，蓝色是我的，沿x轴选择的点数。请注意，当所有＆＃39;全部＆＃39;已选择符合标准的分数。

正如您在评论中所说，绩效高度依赖于d0的价值。事实上，随着d0的减少，上述方法似乎在性能方面有了更大的改进（这适用于d0=0.1）：

但请注意，这也取决于其他因素，例如数据的分布。此方法利用数据集的特定属性，并通过过滤掉计算欧几里德距离无意义的点来减少昂贵的计算次数。这对于选择较少的点尤其有效，并且对于较小的d0实际上更快，因为数据集中与标准匹配的点较少（因此所需的欧几里德距离的计算较少）。这类问题的最佳解决方案通常特定于所使用的确切数据集。

另请注意，在上面的代码中，手动计算欧几里德距离要比调用pdist快得多。在简单的情况下，Matlab内置函数的灵活性和通用性通常不利于性能。

如何选择部分密集数据集的均匀分布子集？

2 个答案: