FactoMineR中PCA摘要中的ctr,距离和尺寸究竟是什么?

时间:2017-08-27 01:58:55

标签: r multidimensional-array vector pca

我正在尝试使用FactoMineR包在我的数据集上实现PCA和MCA。

我有一个数据集,经过一些初步清理后,我在其上应用了PCA()函数。我尝试了解结果摘要。

library(reshape)
library(gridExtra)
library(gdata)
library(ggplot2)
library(ggbiplot)
library(FactoMineR)

x <- read.csv('cars.csv',stringsAsFactors = FALSE)
y <- na.omit(x)

y <- y[,c(-8,-9)]
s <- y[,-1]
rownames(s) <- make.names(y[,1], unique = TRUE)


res.pca <- PCA(s, quanti.sup = NULL, quali.sup=NULL,scale.unit = TRUE,ncp=2)
summary(res.pca)

这是summary(res.pca)在我的控制台中输出的内容

Call:
PCA(X = s, scale.unit = TRUE, ncp = 2, quanti.sup = NULL, quali.sup = NULL) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
Variance               4.788   0.729   0.258   0.125   0.063   0.036
% of var.             79.804  12.144   4.308   2.086   1.053   0.605
Cumulative % of var.  79.804  91.948  96.256  98.342  99.395 100.000

Individuals (the 10 first)
                              Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
chevrolet.chevelle.malibu |  2.516 |  2.326  0.288  0.855 | -0.572  0.115  0.052 |
buick.skylark.320         |  3.307 |  3.206  0.548  0.940 | -0.683  0.163  0.043 |
plymouth.satellite        |  2.915 |  2.670  0.380  0.839 | -0.994  0.346  0.116 |
amc.rebel.sst             |  2.749 |  2.605  0.362  0.898 | -0.623  0.136  0.051 |
ford.torino               |  2.908 |  2.600  0.360  0.799 | -1.094  0.419  0.141 |
ford.galaxie.500          |  4.578 |  4.401  1.032  0.924 | -1.011  0.358  0.049 |
chevrolet.impala          |  5.210 |  4.920  1.289  0.892 | -1.368  0.655  0.069 |
plymouth.fury.iii         |  5.144 |  4.836  1.246  0.884 | -1.537  0.827  0.089 |
pontiac.catalina          |  5.165 |  4.910  1.285  0.904 | -1.041  0.379  0.041 |
amc.ambassador.dpl        |  4.406 |  4.056  0.876  0.847 | -1.668  0.974  0.143 |

Variables
                             Dim.1    ctr   cos2    Dim.2    ctr   cos2  
Cylinders                 |  0.942 18.543  0.888 |  0.127  2.200  0.016 |
Displacement              |  0.971 19.672  0.942 |  0.093  1.177  0.009 |
Horsepower                |  0.950 18.846  0.902 | -0.142  2.761  0.020 |
Weight                    |  0.941 18.499  0.886 |  0.244  8.185  0.060 |
MPG                       | -0.873 15.918  0.762 | -0.209  5.994  0.044 |
Acceleration              | -0.639  8.522  0.408 |  0.762 79.683  0.581 |

虽然我从这个摘要中理解了所有内容,但我不确定数据点上的dist,ctr和dim是什么意思,即

 Individuals (the 10 first)
                              Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
chevrolet.chevelle.malibu |  2.516 |  2.326  0.288  0.855 | -0.572  0.115  0.052 |
buick.skylark.320         |  3.307 |  3.206  0.548  0.940 | -0.683  0.163  0.043 |
plymouth.satellite        |  2.915 |  2.670  0.380  0.839 | -0.994  0.346  0.116 |
amc.rebel.sst             |  2.749 |  2.605  0.362  0.898 | -0.623  0.136  0.051 |

1 个答案:

答案 0 :(得分:2)

让我们根据包中的样本数据集查看个人摘要表,以供说明:

library(FactoMineR)
data(decathlon)
res.pca <- PCA(decathlon, quanti.sup = 11:12, quali.sup=13)

> summary(res.pca)
Call:
PCA(X = decathlon, ncp = 5, quanti.sup = 11:12, quali.sup = 13) 
...
Individuals (the 10 first)
                Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
SEBRLE      |  2.369 |  0.792  0.467  0.112 |  0.772  0.836  0.106 |  0.827  1.187
CLAY        |  3.507 |  1.235  1.137  0.124 |  0.575  0.464  0.027 |  2.141  7.960
KARPOV      |  3.396 |  1.358  1.375  0.160 |  0.484  0.329  0.020 |  1.956  6.644
...

Dist 可以被视为数据集中所有相关列的个人测量值的汇总度量,计算结果为sqrt(rowSums(X^2)),其中X是输入数据集的缩放版本{ {1}}(在删除补充变量之后)。

如果s中的默认选项已到位,即PCAscale.unit = TRUErow.w = NULL,则X应相当于col.w = NULL。我没有检查过这个非默认选项,因为我发现直观的解释比这里的详细计算更重要。

scale(as.matrix(<trimmed down dataset>)) * sqrt(n/n-1)

Dim.X 测量每个人在多维空间中距原点的距离到主成分X的投影。要想象这一点,使用# verify the calculated values against summary table's Dist values > X <- scale(as.matrix(decathlon[,1:10])) * sqrt(nrow(decathlon)/(nrow(decathlon) - 1)) > sqrt(rowSums(X^2)) SEBRLE CLAY KARPOV BERNARD YURKOV WARNERS ZSIVOCZKY 2.368839 3.507004 3.396399 2.762607 3.017906 2.427873 2.563128 ... 作为单独因子图,切换{{ 1}} / plot(res.pca, choix = "ind") / xlim个参数可放大任何特定的个人,&amp;与表值进行比较。检查ylim以获取函数中的更多参数。

axes

individual factor map, zoomed in

ctr 以百分比形式表示每个人对给定主成分的贡献。您可以从?plot.PCA获取完整的贡献列表。每列总计达100(%)。

# plot individual factor map in the first two principle components
> plot(res.pca, axes = c(1, 2), choix = "ind")

# zoom in check Serbrle, Clay, & Karpov's coordinates
> plot(res.pca, axes = c(1, 2), choix = "ind", xlim = c(0, 2), ylim = c(0, 1))

cos2 是每个主成分的平方余弦,计算为(Dim.X / Dist)^ 2。对于给定的主成分,它越接近1,主成分就越能捕捉到该个体的所有特征。

res.pca$ind$contrib

对于变量,“Dim.X”/“ctr”/“cos2”的解释类似。精确的计算更复杂,尤其是如果为行/列指定非均匀权重。您可以查看# view each individual's contribution to each principle component > head(res.pca$ind$contrib) Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 SEBRLE 0.46715109 0.8359506 1.186888 3.1842186 1.7811617 CLAY 1.13695340 0.4635341 7.959744 0.2905893 13.8872052 KARPOV 1.37515734 0.3289363 6.643820 7.9543342 2.2523610 BERNARD 0.27693912 1.0740657 1.374952 11.3801552 0.4658144 YURKOV 0.25595504 6.3757577 2.605847 1.7611939 5.5775065 WARNERS 0.09494738 3.9862179 1.020117 0.8014610 3.5736432 # verify each principle component's contributions sum up to 100%. > colSums(res.pca$ind$contrib) Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 100 100 100 100 100 的代码以获取详细信息。