Question

我一直在玩Scikit-learn的GMM功能。首先，我刚刚在x=y行创建了一个分布。

from sklearn import mixture
import numpy as np 
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

line_model = mixture.GMM(n_components = 99)
#Create evenly distributed points between 0 and 1.
xs = np.linspace(0, 1, 100)
ys = np.linspace(0, 1, 100)

#Create a distribution that's centred along y=x
line_model.fit(zip(xs,ys))
plt.plot(xs, ys)
plt.show()

这会产生预期的分布： The distribution

接下来，我将GMM放入其中，并绘制结果：

#Create the x,y mesh that will be used to make a 3D plot
x_y_grid = []
for x in xs:
    for y in ys:
        x_y_grid.append([x,y])

#Calculate a probability for each point in the x,y grid.
x_y_z_grid = []
for x,y in x_y_grid:
    z = line_model.score([[x,y]])
    x_y_z_grid.append([x,y,z])

x_y_z_grid = np.array(x_y_z_grid)

#Plot probabilities on the Z axis.
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot(x_y_z_grid[:,0], x_y_z_grid[:,1], 2.72**x_y_z_grid[:,2])
plt.show()

得到的概率分布沿着x=0和x=1有一些奇怪的尾巴，并且在角落中有额外的概率（x = 1，y = 1且x = 0，y = 0）。 Probability distribution n=99

使用n_components = 5也会显示以下行为： Probability distribution n=5

这是GMM固有的东西，还是实施方面存在问题，还是我做错了什么？

编辑：从模型中获得分数似乎摆脱了这种行为 - 这应该是吗？

我正在同一数据集上训练两个模型（x = y从x = 0到x = 1）。简单地通过gmm的score方法检查概率似乎消除了这种边界效应。为什么是这样？我已经附上了下面的图表和代码。

Checking the scores over different domains affects the distribution.

# Creates a line of 'observations' between (x_small_start, x_small_end)
# and (y_small_start, y_small_end). This is the data both gmms are trained on.
x_small_start = 0
x_small_end = 1
y_small_start = 0
y_small_end = 1

# These are the range of values that will be plotted
x_big_start = -1
x_big_end = 2
y_big_start = -1
y_big_end = 2


shorter_eval_range_gmm = mixture.GMM(n_components = 5)
longer_eval_range_gmm = mixture.GMM(n_components = 5)

x_small = np.linspace(x_small_start, x_small_end, 100)
y_small = np.linspace(y_small_start, y_small_end, 100)
x_big = np.linspace(x_big_start, x_big_end, 100)
y_big = np.linspace(y_big_start, y_big_end, 100)

#Train both gmms on a distribution that's centered along y=x
shorter_eval_range_gmm.fit(zip(x_small,y_small))
longer_eval_range_gmm.fit(zip(x_small,y_small))


#Create the x,y meshes that will be used to make a 3D plot
x_y_evals_grid_big = []
for x in x_big:
    for y in y_big:
        x_y_evals_grid_big.append([x,y])
x_y_evals_grid_small = []

for x in x_small:
    for y in y_small:
        x_y_evals_grid_small.append([x,y])

#Calculate a probability for each point in the x,y grid.
x_y_z_plot_grid_big = []
for x,y in x_y_evals_grid_big:
    z = longer_eval_range_gmm.score([[x, y]])
    x_y_z_plot_grid_big.append([x, y, z])
x_y_z_plot_grid_big = np.array(x_y_z_plot_grid_big)

x_y_z_plot_grid_small = []
for x,y in x_y_evals_grid_small:
    z = shorter_eval_range_gmm.score([[x, y]])
    x_y_z_plot_grid_small.append([x, y, z])
x_y_z_plot_grid_small = np.array(x_y_z_plot_grid_small)


#Plot probabilities on the Z axis.
fig = plt.figure()
fig.suptitle("Probability of different x,y pairs")

ax1 = fig.add_subplot(1, 2, 1, projection='3d')
ax1.plot(x_y_z_plot_grid_big[:,0], x_y_z_plot_grid_big[:,1], np.exp(x_y_z_plot_grid_big[:,2]))
ax1.set_xlabel('X Label')
ax1.set_ylabel('Y Label')
ax1.set_zlabel('Probability')
ax2 = fig.add_subplot(1, 2, 2, projection='3d')
ax2.plot(x_y_z_plot_grid_small[:,0], x_y_z_plot_grid_small[:,1], np.exp(x_y_z_plot_grid_small[:,2]))
ax2.set_xlabel('X Label')
ax2.set_ylabel('Y Label')
ax2.set_zlabel('Probability')

plt.show()

Answer 1

适合度没有问题，但是您可以使用可视化。提示应该是连接（0,1,5）到（0,1,0）的直线，这实际上只是两点连接的渲染（这是由于读取点的顺序）。尽管极值中的两点在您的数据中，但行上没有其他点。

就我个人而言，由于上述原因，我认为使用三维图（线）表示一个表面是一个相当不错的主意，我建议使用表面图或等高线图。

试试这个：

from sklearn import mixture
import numpy as np 
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

line_model = mixture.GMM(n_components = 99)
#Create evenly distributed points between 0 and 1.
xs = np.atleast_2d(np.linspace(0, 1, 100)).T
ys = np.atleast_2d(np.linspace(0, 1, 100)).T

#Create a distribution that's centred along y=x
line_model.fit(np.concatenate([xs, ys], axis=1))
plt.scatter(xs, ys)
plt.show()

#Create the x,y mesh that will be used to make a 3D plot
X, Y = np.meshgrid(xs, ys)
x_y_grid = np.c_[X.ravel(), Y.ravel()]

#Calculate a probability for each point in the x,y grid.
z = line_model.score(x_y_grid)
z = z.reshape(X.shape)

#Plot probabilities on the Z axis.
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, z)
plt.show()

从学术角度来看，我对通过2D混合模型在2D空间中拟合1D线的目标感到非常不舒服。使用GMM进行流形学习至少需要法线方向的方差为零，从而减少了dirac-分布。在数值上和分析上这是不稳定的，应该避免（在gmm拟合中似乎有一些稳定技巧，因为模型的方差在直线法线的方向上相当大）。

建议在绘制数据时使用plt.scatter而不是plt.plot，因为在您进行联合分布时，没有理由连接点。

希望这有助于揭示你的问题。

Answer 2

编辑：这不正确。与罗纳德P.谈话，你不能得到吉布斯效应，因为高斯人不能通过＆＃34;去消极＆＃34;来相互补偿，因为概率是严格的＆gt;这似乎是一个简单的策划问题......相反，请看他的回答！无论哪种方式，我都建议使用2D数据来测试GMM，而不是一维线。

~~GMM适合您提供的数据 - 具体来说：~~

xs = np.linspace(0, 1, 100) ys = np.linspace(0, 1, 100)

因为数据以0和1结束，GMM正在尝试对该事实进行建模：-.01和1.01在技术上超出了训练数据范围，应该以非常低的概率进行评分。这样做最终会创建一个具有较小扩展（较小的协方差/较高精度）的高斯，以覆盖数据的末端并模拟数据停止的事实。

我希望增加足够的高斯会导致pseudo-Gibbs phenomena效果，你可以看到在5到99的变化中发生这种情况。要准确地模拟边缘，你需要一个无限混合模型。这类似于无限频率分量 - 你代表一个＆＃34;信号＆＃34;在GMM中也有一套基函数（在这种情况下，是高斯人）！

高斯混合模型（GMM）给出了不合适的结果

编辑：从模型中获得分数似乎摆脱了这种行为 - 这应该是吗？

2 个答案: