我在XY平面上有一个numpy点数,如: distribution



是否有任何pythonic方式或任何numpy / scipy函数来执行此操作?

3 个答案:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

total_num = 100000
x, y = np.random.normal(0, 1, (2, total_num))

# We'll always get fewer than this number for two reasons.
# 1) We're choosing a square grid, and "subset_num" may not be a perfect square
# 2) There won't be data in every cell of the grid
subset_num = 1000

# Bin points onto a rectangular grid with approximately "subset_num" cells
nbins = int(np.sqrt(subset_num))
xbins = np.linspace(x.min(), x.max(), nbins+1)
ybins = np.linspace(y.min(), y.max(), nbins+1)

# Make a dataframe indexed by the grid coordinates.
i, j = np.digitize(y, ybins), np.digitize(x, xbins)
df = pd.DataFrame(dict(x=x, y=y), index=[i, j])

# Group by which cell the points fall into and choose a random point from each
groups = df.groupby(df.index)
new = groups.agg(lambda x: np.random.permutation(x)[0])

# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].plot(x, y, 'k.')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(new.x, new.y, 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(new)))
plt.setp(axes, aspect=1, adjustable='box-forced')

根据@ EMS在评论中的建议,这是另一种方法。




import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

total_num = 100000
subset_num = 1000
x, y = np.random.normal(0, 1, (2, total_num))

# Let's approximate the PDF of the point distribution with a kernel density
# estimate. scipy.stats.gaussian_kde is slow for large numbers of points, so
# you might want to use another implementation in some cases.
xy = np.vstack([x, y])
dens = gaussian_kde(xy)(xy)

# Try playing around with this weight. Compare 1/dens,  1-dens, and (1-dens)**2
weight = 1 / dens
weight /= weight.sum()

# Draw a sample using np.random.choice with the specified probabilities.
# We'll need to view things as an object array because np.random.choice
# expects a 1D array.
dat = xy.T.ravel().view([('x', float), ('y', float)])
subset = np.random.choice(dat, subset_num, p=weight)

# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].scatter(x, y, c=dens, edgecolor='')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(subset['x'], subset['y'], 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(subset)))
plt.setp(axes, aspect=1, adjustable='box-forced')

  1. 计算所有点对之间的距离矩阵
  2. 将此距离矩阵视为加权网络,计算数据中每个点的一些中心度量,例如eigenvalue centralityBetweenness centralityBonacich centrality
  3. 根据中心度量以降序对点进行排序,并保留前100个。
  4. 重复步骤1-4,可能使用点之间的“距离”和不同的中心度量的不同概念。
  5. 其中许多函数都是由SciPy,NetworkX和scikits.learn直接提供的,并且可以直接在NumPy数组上运行。

    如果您确实致力于在常规间距和网格密度方面考虑问题,那么您可以查看quasi-Monte Carlo methods。特别是,您可以尝试计算点集的凸包,然后应用QMC技术定期从该凸包内的任何位置进行采样。但同样,这使得该地区的外部特权得以实现,该地区的采样应远远少于内部。

    另一个有趣的方法是简单地在散乱数据上运行K-means算法,其中固定数量的簇K = 100。算法收敛后,你的空间将得到100分(每个星团的平均值)。您可以使用群集均值的不同随机起点重复此操作几次,然后从该更大的可能方法集合中进行采样。由于您的数据看起来并没有自然地集成到100个组件中,因此这种方法的收敛性不会很好,并且可能需要运行算法进行大量迭代。这也有一个缺点,即由此产生的100点不一定是观测数据的点,而是多点的局部平均值。

from numpy import array, argmax, ndarray
from import vstack
from numpy.random import normal, randint
from scipy.spatial.distance import cdist

def well_spaced_points(points: ndarray, num_points: int):
    Pick `num_points` well-spaced points from `points` array.

    :param points: An m x n array of m n-dimensional points.
    :param num_points: The number of points to pick.
    :rtype: ndarray
    :return: A num_points x n array of points from the original array.
    # pick a random point
    current_point_index = randint(0, num_points)
    picked_points = array([points[current_point_index]])
    remaining_points = vstack((
        points[: current_point_index],
        points[current_point_index + 1:]
    # while there are more points to pick
    while picked_points.shape[0] < num_points:
        # find the furthest point to the current point
        distance_pk_rmn = cdist(picked_points, remaining_points)
        min_distance_pk = distance_pk_rmn.min(axis=0)
        i_furthest = argmax(min_distance_pk)
        # add it to picked points and remove it from remaining
        picked_points = vstack((
        remaining_points = vstack((
            remaining_points[: i_furthest],
            remaining_points[i_furthest + 1:]

    return picked_points

