R中的PCA多色谱

时间:2014-06-18 09:38:52

标签: r plot 3d cluster-analysis pca

我有一个如下所示的数据集:

India   China   Brasil  Russia  SAfrica Kenya   States  Indonesia   States  Argentina   Chile   Netherlands HongKong
0.0854026763    0.1389383234    0.1244184371    0.0525460881    0.2945586244    0.0404562539    0.0491597968    0   0   0.0618342901    0.0174891774    0.0634064181    0
0.0519483159    0.0573851759    0.0756806292    0.0207164181    0.0409872092    0.0706355932    0.0664503936    0.0775285039    0.008545575 0.0365674701    0.026595575 0.064280902 0.0338135148
0   0   0   0   0   0   0   0   0   0   0   0   0
0.0943708876    0   0   0.0967733329    0   0.0745076688    0   0   0   0.0427047276    0   0.0583873189    0
0.0149521013    0.0067569437    0.0108914448    0.0229991162    0.0151678343    0.0413174214    0   0.0240999375    0   0.0608951432    0.0076549109    0   0.0291972756
0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0   0   0   0   0   0   0   0   0   0
0   0   0   0.0096710124    0.0095669967    0   0.0678582869    0   0   0.0170707337    0.0096565543    0.0116698364    0.0122773071
0.1002690681    0.0934563916    0.0821680095    0.1349534369    0.1017157777    0.1113249348    0.1713480649    0.0538715423    0.4731833978    0.1956743964    0.6865919069    0.2869189344    0.5364034876
1.5458338337    0.2675380321    0.6229046372    0.5059107039    0.934209603 0.4933799388    0.4259769181    0.3534169521    14.4134845836   4.8817632117    13.4034293299   3.7849346739    12.138551171
0.4625375671    0.320258205 0.4216459567    0.4992764309    0.4115887595    0.4783677078    0.4982410179    0.2790259278    0.3804405781    0.2594924212    0.4542162376    0.3012339384    0.3450847892
0.357614592 0.3932670219    0.3803417257    0.4615355254    0.3807061655    0.4122433346    0.4422282977    0.3053712842    0.297943232 0.2658160167    0.3244018409    0.2523836582    0.3106600754
0.359953567 0.3958391813    0.3828293473    0.4631507073    0.3831961707    0.4138590365    0.4451206879    0.3073685624    0.2046559772    0.2403036541    0.2326305393    0.2269373716    0.2342962436
0.7887404662    0.6545878236    0.7443676393    0.7681244767    0.5938002158    0.5052305973    0.4354571648    0.40511005  0.8372481106    0.5971130339    0.8025313223    0.5708610817    0.8556609579
0.5574207497    1.2175251783    0.8797484259    0.952685465 0.4476585005    1.1919229479    1.03612509  0.5490564488    0.2407034171    0.5675492645    0.4994121344    0.5460544861    0.3779468604
0.5632651223    1.0181714714    1.1253803155    1.228293512 0.6949993291    1.0346288085    0.5955221073    0.5212567091    1.1674901423    1.2442735568    1.207624867 1.3854352274    0.7557131826
0.6914760031    0.7831502333    1.0282730148    0.750270567 0.7072739935    0.8041764647    0.8918512571    0.6998554585    2.3448306081    1.2905783367    2.4295927684    1.3029766224    1.9310763864
0.3459898177    0.7474525109    0.7253451876    0.7182493014    0.3081791886    0.7462088907    0.5950509439    0.4443221541    3.6106852374    2.7647504885    3.3698608994    2.6523062395    1.8016571476
0.4629523517    0.6549211677    0.6158018856    0.7637088814    0.4951554309    0.6277236471    0.6227669055    0.383909839 2.9502307101    1.803480973 2.3083113522    1.668759497 1.7130459012
0.301548861 0.5961888126    0.4027007075    0.5540290853    0.4078662541    0.5108773106    0.4610682726    0.3712800134    0.3813402422    0.7391417247    1.0935364978    0.691857974 0.4416304953
2.5038287529    3.2005148394    2.9181517373    3.557918333 1.8868234768    2.9369926312    0.4117894127    0.3074815035    3.9187777037    7.3161555954    6.9586996112    5.7096144353    2.7007439732
2.5079707359    3.2058093222    2.9229791182    3.563804054 1.8899447728    2.9418511798    0.4124706194    0.269491388 3.9252603798    7.3282584169    6.9702111077    5.7190596205    2.7052117051
2.6643724791    1.2405320493    2.0584120188    2.2354369334    1.7199730388    2.039829709 1.7428132997    0.9977029725    8.9650886611    4.6035139163    8.1430131464    5.2450639988    6.963309864
0.5270581435    0.8222128903    0.7713479951    0.8785815313    0.624993821 0.7410405193    0.5350834321    0.4797121891    1.3753525725    1.2219267886    1.397221881 1.2433155977    0.8647136903
0.2536079475    0.5195514789    0.0492623195    0.416102668 0.2572670724    0.4805482899    0.4866090738    0.4905212099    0.2002506403    0.5508609827    0.3808572148    0.6276294938    0.3191452919
0.3499009885    0.5837491529    0.4914807442    0.5851537888    0.3638549977    0.537655052 0.5757185943    0.4730102035    0.9098072064    0.6197285737    0.7781825654    0.6424684366    0.6424429128
0.6093076876    0.9456457011    0.8518013605    1.1360347777    0.511960743 0.9038104168    0.5048413575    0.2777622235    0.2915840525    0.6628516415    0.4600364351    0.7996524113    0.3765721177
0.9119207879    1.2363073271    1.3285269752    1.4027039939    0.9250782309    2.1599381031    1.312307839 0   0   0.8253250513    0   0   0.8903632354

它存储在data.txt文件中。

我想要一个如下所示的PCA多重图:THIS

我在做什么:

d <- read.table("data.txt", header=TRUE, as.is=TRUE)
model <- prcomp(d, scale=TRUE)

在此之后我迷路了。

如何根据PCA投影聚类数据集并获得与上述类似的图片?

1 个答案:

答案 0 :(得分:12)

您实际上是在问两个不同的问题:

  1. 如何在PCA预测后对数据进行聚类。
  2. 如何获得上述情节。
  3. 然而,在进入那些之前我想补充一点,如果您的样本在列中,那么您没有正确地进行PCA。您应该在转置数据集上执行此操作,如下所示:

    model <- prcomp(t(d), scale=TRUE)
    

    但要实现这一点,您必须删除数据中的所有常量行。


    现在我假设您按照自己想要的方式完成了PCA步骤。

    当你指定retX = TRUE时,

    prcomp返回旋转的矩阵(默认情况下它是真的)。因此,您需要使用model$x

    您的下一步是根据主要组件对数据进行聚类。这可以通过各种方式完成。一个是层次聚类。如果你想要最后5组,这是一种方式:

    fit <- hclust(dist(model$x[,1:3]), method="complete") # 1:3 -> based on 3 components
    groups <- cutree(fit, k=5)                            # k=5 -> 5 groups
    

    此步骤将为您提供稍后用于着色的组。

    最后一步是密谋。在这里,我写了一个简单的函数,一次性完成所有操作:

    library(rgl)
    plotPCA <- function(x, nGroup) {
        n <- ncol(x) 
        if(!(n %in% c(2,3))) { # check if 2d or 3d
            stop("x must have either 2 or 3 columns")
        }
    
        fit <- hclust(dist(x), method="complete") # cluster
        groups <- cutree(fit, k=nGroup)
    
        if(n == 3) { # 3d plot
            plot3d(x, col=groups, type="s", size=1, axes=F)
            axes3d(edges=c("x--", "y--", "z"), lwd=3, axes.len=2, labels=FALSE)
            grid3d("x")
            grid3d("y")
            grid3d("z")
        } else { # 2d plot
            maxes <- apply(abs(x), 2, max)
            rangeX <- c(-maxes[1], maxes[1])
            rangeY <- c(-maxes[2], maxes[2])
            plot(x, col=groups, pch=19, xlab=colnames(x)[1], ylab=colnames(x)[2], xlim=rangeX, ylim=rangeY)
            lines(c(0,0), rangeX*2)
            lines(rangeY*2, c(0,0))
        }
    }
    

    这个函数很简单:它有两个参数:1)得分矩阵,主要成分在列中,样本在行中。如果你想要(例如)第一,第二和第四个组件,你基本上可以使用模型$ x [,c(1,2,4)]。 2)用于聚类的组数。

    然后根据传递的主成分和图(2D或3D,取决于传递的列数)对数据进行聚类

    以下是一些例子:

    plotPCA(model$x[,1:2], 5)
    

    PCA2d

    3D示例(基于3个第一主成分):

    plotPCA(model$x[,1:3], 5)
    

    PCA3D

    最后一个图将是交互式的,因此您可以将其旋转或放大/缩小。

    希望这有帮助。