您好,我有一个关于将MDS与Python结合使用的非常具体,奇怪的问题。
在创建原始高维数据集的距离矩阵(简称为distanceHD)时,您可以使用欧几里得距离或曼哈顿距离来测量所有数据点之间的距离。
然后,在执行MDS之后,假设我将70列以上的列减少到2列。现在,我可以创建一个新的距离矩阵。我们称它为distance2D,它可以再次测量曼哈顿或欧几里得中数据点之间的距离。
最后,我可以找到两个距离矩阵之间的差(distanceHD和distance2D之间),如果保留了从大型数据集到较少列的新数据集中的数据点之间的距离,这个新的差异矩阵将向我显示。 (执行MDS之后)。然后,我可以使用该差异矩阵上的应力函数来计算应力,数值越接近0,则投影效果越好。
我的问题: 最初,我被教导要在distanceHD矩阵中使用曼哈顿距离,并在distance2D矩阵中使用欧几里得距离。但为什么?为什么不同时使用曼哈顿呢?还是两者皆有?还是距离HD上的欧几里得距离和距离2D上的曼哈顿距离?
我猜想还有一个总的问题:何时在MDS算法上使用任一距离度量?
很抱歉,冗长且可能引起混淆的帖子。我在下面显示了一个示例:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataHD = pd.DataFrame(
[[0,0,0,0],
[1,1,1,1],
[0,1,2,3],
[0,0,0,1]],
index=['A','B','C','D'],
columns=['1','2','3','4'])
dataHD
import sklearn.metrics.pairwise as smp
distHD = smp.manhattan_distances(dataHD) #L1 Distance Function
distHD = pd.DataFrame(distHD, columns=dataHD.index, index=dataHD.index)
distHD
import sklearn.manifold
# Here were going to find the local min/maxs
# the disimilarity parameter is referencing the distance matrix
# shift + tab will show parameters
# n_init: Number of times the k-means algorithm will be run with different centroid seeds.
# The final results will be the best output of n_init consecutive runs in terms of inertia.
# max_iter: Maximum number of iterations of the k-means algorithm for a single run.
mds = sklearn.manifold.MDS(dissimilarity = 'precomputed', n_init=10, max_iter=1000)
# NOTE: you will get different numbers everytime you run this. this is because youll
# find different local mins
# The key takeaway here is that the distance between data points are preserved
data2D = mds.fit_transform(distHD)
# Recall: were using new columns that summarize the distHD table..pick new column names
data2D = pd.DataFrame(data2D, columns=['x', 'y'], index = dataHD.index)
data2D
## Plot the MDS 2D result
%matplotlib inline
ax = data2D.plot.scatter(x='x', y='y')
# How to label those data points
ax.text(data2D.x[0], data2D.y[0], 'A')
ax.text(data2D.x[1], data2D.y[1], 'B')
ax.text(data2D.x[2], data2D.y[2], 'C')
ax.text(data2D.x[3], data2D.y[3], 'D')
dist2D = sklearn.metrics.euclidean_distances(data2D)
dist2D = pd.DataFrame(dist2D, columns = data2D.index, index = data2D.index)
dist2D
## Stress function...the formula given above
np.sqrt(((distHD - dist2D) **2).sum().sum() / (distHD**2).sum().sum())