我正在尝试使用FactoMineR
包在我的数据集上实现PCA和MCA。
我有一个数据集,经过一些初步清理后,我在其上应用了PCA()
函数。我尝试了解结果摘要。
library(reshape)
library(gridExtra)
library(gdata)
library(ggplot2)
library(ggbiplot)
library(FactoMineR)
x <- read.csv('cars.csv',stringsAsFactors = FALSE)
y <- na.omit(x)
y <- y[,c(-8,-9)]
s <- y[,-1]
rownames(s) <- make.names(y[,1], unique = TRUE)
res.pca <- PCA(s, quanti.sup = NULL, quali.sup=NULL,scale.unit = TRUE,ncp=2)
summary(res.pca)
这是summary(res.pca)
在我的控制台中输出的内容
Call:
PCA(X = s, scale.unit = TRUE, ncp = 2, quanti.sup = NULL, quali.sup = NULL)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
Variance 4.788 0.729 0.258 0.125 0.063 0.036
% of var. 79.804 12.144 4.308 2.086 1.053 0.605
Cumulative % of var. 79.804 91.948 96.256 98.342 99.395 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2
chevrolet.chevelle.malibu | 2.516 | 2.326 0.288 0.855 | -0.572 0.115 0.052 |
buick.skylark.320 | 3.307 | 3.206 0.548 0.940 | -0.683 0.163 0.043 |
plymouth.satellite | 2.915 | 2.670 0.380 0.839 | -0.994 0.346 0.116 |
amc.rebel.sst | 2.749 | 2.605 0.362 0.898 | -0.623 0.136 0.051 |
ford.torino | 2.908 | 2.600 0.360 0.799 | -1.094 0.419 0.141 |
ford.galaxie.500 | 4.578 | 4.401 1.032 0.924 | -1.011 0.358 0.049 |
chevrolet.impala | 5.210 | 4.920 1.289 0.892 | -1.368 0.655 0.069 |
plymouth.fury.iii | 5.144 | 4.836 1.246 0.884 | -1.537 0.827 0.089 |
pontiac.catalina | 5.165 | 4.910 1.285 0.904 | -1.041 0.379 0.041 |
amc.ambassador.dpl | 4.406 | 4.056 0.876 0.847 | -1.668 0.974 0.143 |
Variables
Dim.1 ctr cos2 Dim.2 ctr cos2
Cylinders | 0.942 18.543 0.888 | 0.127 2.200 0.016 |
Displacement | 0.971 19.672 0.942 | 0.093 1.177 0.009 |
Horsepower | 0.950 18.846 0.902 | -0.142 2.761 0.020 |
Weight | 0.941 18.499 0.886 | 0.244 8.185 0.060 |
MPG | -0.873 15.918 0.762 | -0.209 5.994 0.044 |
Acceleration | -0.639 8.522 0.408 | 0.762 79.683 0.581 |
虽然我从这个摘要中理解了所有内容,但我不确定数据点上的dist,ctr和dim是什么意思,即
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2
chevrolet.chevelle.malibu | 2.516 | 2.326 0.288 0.855 | -0.572 0.115 0.052 |
buick.skylark.320 | 3.307 | 3.206 0.548 0.940 | -0.683 0.163 0.043 |
plymouth.satellite | 2.915 | 2.670 0.380 0.839 | -0.994 0.346 0.116 |
amc.rebel.sst | 2.749 | 2.605 0.362 0.898 | -0.623 0.136 0.051 |
答案 0 :(得分:2)
让我们根据包中的样本数据集查看个人摘要表,以供说明:
library(FactoMineR)
data(decathlon)
res.pca <- PCA(decathlon, quanti.sup = 11:12, quali.sup=13)
> summary(res.pca)
Call:
PCA(X = decathlon, ncp = 5, quanti.sup = 11:12, quali.sup = 13)
...
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
SEBRLE | 2.369 | 0.792 0.467 0.112 | 0.772 0.836 0.106 | 0.827 1.187
CLAY | 3.507 | 1.235 1.137 0.124 | 0.575 0.464 0.027 | 2.141 7.960
KARPOV | 3.396 | 1.358 1.375 0.160 | 0.484 0.329 0.020 | 1.956 6.644
...
Dist 可以被视为数据集中所有相关列的个人测量值的汇总度量,计算结果为sqrt(rowSums(X^2))
,其中X是输入数据集的缩放版本{ {1}}(在删除补充变量之后)。
如果s
中的默认选项已到位,即PCA
,scale.unit = TRUE
,row.w = NULL
,则X应相当于col.w = NULL
。我没有检查过这个非默认选项,因为我发现直观的解释比这里的详细计算更重要。
scale(as.matrix(<trimmed down dataset>)) * sqrt(n/n-1)
Dim.X 测量每个人在多维空间中距原点的距离到主成分X的投影。要想象这一点,使用# verify the calculated values against summary table's Dist values
> X <- scale(as.matrix(decathlon[,1:10])) * sqrt(nrow(decathlon)/(nrow(decathlon) - 1))
> sqrt(rowSums(X^2))
SEBRLE CLAY KARPOV BERNARD YURKOV WARNERS ZSIVOCZKY
2.368839 3.507004 3.396399 2.762607 3.017906 2.427873 2.563128
...
作为单独因子图,切换{{ 1}} / plot(res.pca, choix = "ind")
/ xlim
个参数可放大任何特定的个人,&amp;与表值进行比较。检查ylim
以获取函数中的更多参数。
axes
ctr 以百分比形式表示每个人对给定主成分的贡献。您可以从?plot.PCA
获取完整的贡献列表。每列总计达100(%)。
# plot individual factor map in the first two principle components
> plot(res.pca, axes = c(1, 2), choix = "ind")
# zoom in check Serbrle, Clay, & Karpov's coordinates
> plot(res.pca, axes = c(1, 2), choix = "ind", xlim = c(0, 2), ylim = c(0, 1))
cos2 是每个主成分的平方余弦,计算为(Dim.X / Dist)^ 2。对于给定的主成分,它越接近1,主成分就越能捕捉到该个体的所有特征。
res.pca$ind$contrib
对于变量,“Dim.X”/“ctr”/“cos2”的解释类似。精确的计算更复杂,尤其是如果为行/列指定非均匀权重。您可以查看# view each individual's contribution to each principle component
> head(res.pca$ind$contrib)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
SEBRLE 0.46715109 0.8359506 1.186888 3.1842186 1.7811617
CLAY 1.13695340 0.4635341 7.959744 0.2905893 13.8872052
KARPOV 1.37515734 0.3289363 6.643820 7.9543342 2.2523610
BERNARD 0.27693912 1.0740657 1.374952 11.3801552 0.4658144
YURKOV 0.25595504 6.3757577 2.605847 1.7611939 5.5775065
WARNERS 0.09494738 3.9862179 1.020117 0.8014610 3.5736432
# verify each principle component's contributions sum up to 100%.
> colSums(res.pca$ind$contrib)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
100 100 100 100 100
的代码以获取详细信息。