Question

我在XY平面上有一个numpy点数，如： distribution

我想从所有这些点中选择更好地分布的n个点（比如100）。这就是说，我希望点的密度在任何地方都是恒定的。

这样的事情：

enter image description here

是否有任何pythonic方式或任何numpy / scipy函数来执行此操作？

Answer 1

@EMS是非常正确的，你应该考虑到你想要的东西。

有更复杂的方法可以做到这一点（EMS的建议非常好！），但是蛮力的方法是将点分成常规的矩形网格，并从每个bin中绘制一个随机点。

主要缺点是你不会得到你要求的分数。相反，你会得到一些小于这个数字的数字。

使用pandas进行一些创意索引会使这种“网格化”方法变得非常简单，尽管你也可以使用“纯粹”的numpy来做到这一点。

作为最简单的，强力，网格方法的一个例子:(在这里，我们可以做得更好。）

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

total_num = 100000
x, y = np.random.normal(0, 1, (2, total_num))

# We'll always get fewer than this number for two reasons.
# 1) We're choosing a square grid, and "subset_num" may not be a perfect square
# 2) There won't be data in every cell of the grid
subset_num = 1000

# Bin points onto a rectangular grid with approximately "subset_num" cells
nbins = int(np.sqrt(subset_num))
xbins = np.linspace(x.min(), x.max(), nbins+1)
ybins = np.linspace(y.min(), y.max(), nbins+1)

# Make a dataframe indexed by the grid coordinates.
i, j = np.digitize(y, ybins), np.digitize(x, xbins)
df = pd.DataFrame(dict(x=x, y=y), index=[i, j])

# Group by which cell the points fall into and choose a random point from each
groups = df.groupby(df.index)
new = groups.agg(lambda x: np.random.permutation(x)[0])

# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].plot(x, y, 'k.')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(new.x, new.y, 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(new)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()

enter image description here

根据@ EMS在评论中的建议，这是另一种方法。

我们将使用核密度估计来计算点的密度，然后使用它的倒数作为选择给定点的概率。

scipy.stats.gaussian_kde未针对此用例进行优化（或通常针对大量点）。这是瓶颈。可以通过多种方式为此特定用例编写更优化的版本（近似值，此处是成对距离的特殊情况等）。但是，这超出了这个特定问题的范围。请注意，对于具有1e5点的此特定示例，将需要一两分钟才能运行。

此方法的优点是可以获得所要求的确切点数。缺点是您可能拥有选定点的本地群集。

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

total_num = 100000
subset_num = 1000
x, y = np.random.normal(0, 1, (2, total_num))

# Let's approximate the PDF of the point distribution with a kernel density
# estimate. scipy.stats.gaussian_kde is slow for large numbers of points, so
# you might want to use another implementation in some cases.
xy = np.vstack([x, y])
dens = gaussian_kde(xy)(xy)

# Try playing around with this weight. Compare 1/dens,  1-dens, and (1-dens)**2
weight = 1 / dens
weight /= weight.sum()

# Draw a sample using np.random.choice with the specified probabilities.
# We'll need to view things as an object array because np.random.choice
# expects a 1D array.
dat = xy.T.ravel().view([('x', float), ('y', float)])
subset = np.random.choice(dat, subset_num, p=weight)

# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].scatter(x, y, c=dens, edgecolor='')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(subset['x'], subset['y'], 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(subset)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()

enter image description here

Answer 2

除非你给出定义“更好分配”的特定标准，否则我们无法给出明确的答案。

短语“任意点的恒定密度”也具有误导性，因为您必须指定用于计算密度的经验方法。你在网格上逼近它吗？如果是这样，网格大小将很重要，并且边界附近的点将无法正确表示。

另一种方法可能如下：

计算所有点对之间的距离矩阵
将此距离矩阵视为加权网络，计算数据中每个点的一些中心度量，例如eigenvalue centrality，Betweenness centrality或Bonacich centrality。
根据中心度量以降序对点进行排序，并保留前100个。
重复步骤1-4，可能使用点之间的“距离”和不同的中心度量的不同概念。

其中许多函数都是由SciPy，NetworkX和scikits.learn直接提供的，并且可以直接在NumPy数组上运行。

如果您确实致力于在常规间距和网格密度方面考虑问题，那么您可以查看quasi-Monte Carlo methods。特别是，您可以尝试计算点集的凸包，然后应用QMC技术定期从该凸包内的任何位置进行采样。但同样，这使得该地区的外部特权得以实现，该地区的采样应远远少于内部。

另一个有趣的方法是简单地在散乱数据上运行K-means算法，其中固定数量的簇K = 100。算法收敛后，你的空间将得到100分（每个星团的平均值）。您可以使用群集均值的不同随机起点重复此操作几次，然后从该更大的可能方法集合中进行采样。由于您的数据看起来并没有自然地集成到100个组件中，因此这种方法的收敛性不会很好，并且可能需要运行算法进行大量迭代。这也有一个缺点，即由此产生的100点不一定是观测数据的点，而是多点的局部平均值。

Answer 3

这种方法从剩余的点到已经拾取的点的最小距离迭代地迭代选择点具有可怕的时间复杂度，但产生非常均匀分布的结果：

from numpy import array, argmax, ndarray
from numpy.ma import vstack
from numpy.random import normal, randint
from scipy.spatial.distance import cdist


def well_spaced_points(points: ndarray, num_points: int):
    """
    Pick `num_points` well-spaced points from `points` array.

    :param points: An m x n array of m n-dimensional points.
    :param num_points: The number of points to pick.
    :rtype: ndarray
    :return: A num_points x n array of points from the original array.
    """
    # pick a random point
    current_point_index = randint(0, num_points)
    picked_points = array([points[current_point_index]])
    remaining_points = vstack((
        points[: current_point_index],
        points[current_point_index + 1:]
    ))
    # while there are more points to pick
    while picked_points.shape[0] < num_points:
        # find the furthest point to the current point
        distance_pk_rmn = cdist(picked_points, remaining_points)
        min_distance_pk = distance_pk_rmn.min(axis=0)
        i_furthest = argmax(min_distance_pk)
        # add it to picked points and remove it from remaining
        picked_points = vstack((
            picked_points,
            remaining_points[i_furthest]
        ))
        remaining_points = vstack((
            remaining_points[: i_furthest],
            remaining_points[i_furthest + 1:]
        ))

    return picked_points

Python：从一堆点中选择更好地分布的n个点

3 个答案: