PCA空间和特征空间中的质心距离计算'偏离

时间:2016-12-06 18:12:35

标签: r machine-learning distance data-mining pca

我测量PCA空间的质心和特征空间'跨越约20个治疗组和3组。如果我正确理解我的数学老师,他们之间的距离应该相同。然而,在我计算它们的方式中,它们并非如此,我想知道我是否采用数学方法,其中任何一个都是错误的。

我使用臭名昭着的葡萄酒数据集作为我的方法/ MWE的插图:

library(ggbiplot)
data(wine)
treatments <- 1:2 #treatments to be considerd for this calculation
wine.pca <- prcomp(wine[treatments], scale. = TRUE)
#calculate the centroids for the feature/treatment space and the pca space
df.wine.x <- as.data.frame(wine.pca$x)
df.wine.x$groups <- wine.class
wine$groups <- wine.class
feature.centroids <- aggregate(wine[treatments], list(Type = wine$groups), mean)
pca.centroids <- aggregate(df.wine.x[treatments], list(Type = df.wine.x$groups), mean)
pca.centroids
feature.centroids
#calculate distance between the centroids of barolo and grignolino
dist(rbind(feature.centroids[feature.centroids$Type == "barolo",][-1],feature.centroids[feature.centroids$Type == "grignolino",][-1]), method = "euclidean")
dist(rbind(pca.centroids[pca.centroids$Type == "barolo",][-1],pca.centroids[pca.centroids$Type == "grignolino",][-1]), method = "euclidean")

最后两行返回1.468087表示特征空间中的距离和pca空间中的1.80717,表示美中不足...

1 个答案:

答案 0 :(得分:1)

这是因为缩放和居中,如果你不进行缩放和居中,距离将在原始和PCA特征空间中完全相同。

wine.pca <- prcomp(wine[treatments], scale=FALSE, center=FALSE)

dist(rbind(feature.centroids[feature.centroids$Type == "barolo",][-1],feature.centroids[feature.centroids$Type == "grignolino",][-1]), method = "euclidean")
#         1
# 2 1.468087
dist(rbind(pca.centroids[pca.centroids$Type == "barolo",][-1],pca.centroids[pca.centroids$Type == "grignolino",][-1]), method = "euclidean")
#         1
# 2 1.468087

另一种方法是获得相同的结果是缩放/居中原始数据,然后应用具有缩放/居中的PCA,如下所示:

wine[treatments] <- scale(wine[treatments], center = TRUE)
wine.pca <- prcomp(wine[treatments], scale = TRUE)

dist(rbind(feature.centroids[feature.centroids$Type == "barolo",][-1],feature.centroids[feature.centroids$Type == "grignolino",][-1]), method = "euclidean")
#        1
# 2 1.80717
dist(rbind(pca.centroids[pca.centroids$Type == "barolo",][-1],pca.centroids[pca.centroids$Type == "grignolino",][-1]), method = "euclidean")
#        1
# 2 1.80717