Question

假设我在地图上有X和Y坐标以及＆＃34;热区的非参数分布＆＃34; （例如，位于X和Y坐标的地理地图上的污染程度）。我的输入数据是热图。

我想训练一个机器学习模型，该模型可以学习什么是热区＆＃34;看起来像，但我没有很多标记的例子。所有＆＃34;热区＆＃34;看起来很相似，但可能在我的标准化XY坐标图的不同部分。

我可以计算多变量KDE并相应地绘制密度图。为了生成合成标记数据，我可以反转＆＃34; KDE并随机生成新的图像文件，其中的观察结果属于我的KDE＆＃34;密集＆＃34;范围？

在python中有什么办法吗？

Answer 1

python至少有3种高质量的内核密度估算实现：

我的个人排名是 statsmodels＆gt; scikit-learn＆gt; scipy（从最好到最差）但这取决于你的用例。

一些随机评论：

scikit-learn提供免费安装的KDE（kde.sample(N)）
scikit-learn提供基于网格搜索或随机搜索的良好交叉验证功能（强烈推荐交叉验证）
statsmodels提供基于优化的交叉验证方法（大数据集可能很慢;但精度非常高）

还有更多的差异，其中一些差异在 Jake VanderPlas 非常好blog post中进行了分析。下表摘自这篇文章：

^{来自：https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/（作者：Jake VanderPlas）}

以下是使用 scikit-learn ：

的示例代码

from sklearn.datasets import make_blobs
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import numpy as np

# Create test-data
data_x, data_y = make_blobs(n_samples=100, n_features=2, centers=7, cluster_std=0.5, random_state=0)

# Fit KDE (cross-validation used!)
params = {'bandwidth': np.logspace(-1, 2, 30)}
grid = GridSearchCV(KernelDensity(), params)
grid.fit(data_x)
kde = grid.best_estimator_
bandwidth = grid.best_params_['bandwidth']

# Resample
N_POINTS_RESAMPLE = 1000
resampled = kde.sample(N_POINTS_RESAMPLE)

# Plot original data vs. resampled
fig, axs = plt.subplots(2, 2, sharex=True, sharey=True)

for i in range(100):
    axs[0,0].scatter(*data_x[i])
axs[0,1].hexbin(data_x[:, 0], data_x[:, 1], gridsize=20)

for i in range(N_POINTS_RESAMPLE):
    axs[1,0].scatter(*resampled[i])
axs[1,1].hexbin(resampled[:, 0], resampled[:, 1], gridsize=20)

plt.show()

从计算的多变量核密度估计中采样

1 个答案:

以下是使用 scikit-learn ：

输出：