我是一名生物学家。我的实验的输出包含大量功能(存储为列数和563行)。这些列的数量是8603,非常高。
因此,当我尝试在R中进行PCA分析时,会出现“内存不足”错误。
我也尝试过将princomp打成碎片,但它似乎不适用于我们的 方法
我尝试使用链接中提供的脚本...
但它仍然没有炒作:(
我正在尝试使用以下代码
bumpus <- read.table("http://www.ndsu.nodak.edu/ndsu/doetkott/introsas/rawdata/bumpus.html",
skip=20, nrows=49,
col.names=c("id","total","alar","head","humerus","sternum"))
boxplot(bumpus, main="Boxplot of Bumpus' data") ## in this step it is showing the ERROR
# we first standardize the data:
bumpus.scaled <- data.frame( apply(bumpus,2,scale) )
boxplot(bumpus.scaled, main="Boxplot of standardized Bumpus' data")
pca.res <- prcomp(bumpus.scaled, retx=TRUE)
pca.res
# note:
# PC.1 is some kind of average of all the measurements
# => measure of size of the bird
# PC.2 has a negative weight for 'sternum'
# and positive weights for 'alar', 'head' and 'humerus'
# => measure of shape of the bird
# first two principal components:
pca.res$x[,1:2]
plot(pca.res$x[,1:2], pch="", main="PC.1 and PC.2 for Bumpus' data (blue=survived, red=died)")
text(pca.res$x[,1:2], labels=c(1:49), col=c(rep("blue",21),rep("red",28)))
abline(v=0, lty=2)
abline(h=0, lty=2)
# compare to segment plot:
windows()
palette(rainbow(12, s = 0.6, v = 0.75))
stars(bumpus, labels=c(1:49), nrow=6, key.loc=c(20,-1),
main="Segment plot of Bumpus' data", draw.segment=TRUE)
# compare to biplot:
windows()
biplot(pca.res, scale=0)
# what do the arrows mean?
# consider the arrow for sternum:
abline(0, pca.res$rotation[5,2]/pca.res$rotation[5,1])
# consider the arrow for head:
abline(0, pca.res$rotation[3,2]/pca.res$rotation[3,1])
但第二行
boxplot(bumpus,main =“Bumpus数据的Boxplot”)##显示错误
错误是
Error: cannot allocate vector of size 1.4 Mb
In addition: There were 27 warnings (use warnings() to see them)
请帮忙!
答案 0 :(得分:3)
如果要素数量巨大或超过数量 观察,建议根据基础计算主成分 转置数据集。在您的情况下尤其如此,因为默认情况 意味着计算一个本身已经存在的8603 x 8603协方差矩阵 消耗大约500 MB的内存(哦,这不是太多,但是嘿......)。
假设矩阵X
的行与观察结果相对应
和列对应功能,居中数据,然后执行PCA
转置居中的X
。没有比数量更多的特征对
无论如何观察。最后,将每个得到的特征向量乘以X^T
。你做
不需要为特征值做后者(详见下面的详细说明):
此代码演示了转置数据集上PCA的实现,并比较了prcomp
和“转置PCA”的结果:
pca.reduced <- function(X, center=TRUE, retX=TRUE) {
# Note that the data must first be centered on the *original* dimensions
# because the centering of the 'transposed covariance' is meaningless for
# the dataset. This is also why Sigma must be computed dependent on N
# instead of simply using cov().
if (center) {
mu <- colMeans(X)
X <- sweep(X, 2, mu, `-`)
}
# From now on we're looking at the transpose of X:
Xt <- t(X)
aux <- svd(Xt)
V <- Xt %*% aux$v
# Normalize the columns of V.
V <- apply(V, 2, function(x) x / sqrt(sum(x^2)))
# Done.
list(X = if (retX) X %*% V else NULL,
V = V,
sd = aux$d / sqrt(nrow(X)-1),
mean = if (center) mu else NULL)
}
# Example data (low-dimensional, but sufficient for this example):
X <- cbind(rnorm(1000), rnorm(1000) * 5, rnorm(1000) * 3)
original <- prcomp(X, scale=FALSE)
transposed <- pca.reduced(X)
# See what happens:
> print(original$sdev)
[1] 4.6468136 2.9240382 0.9681769
> print(transposed$sd)
[1] 4.6468136 2.9240382 0.9681769
>
> print(original$rotation)
PC1 PC2 PC3
[1,] -0.0055505001 0.0067322416 0.999961934
[2,] -0.9999845292 -0.0004024287 -0.005547916
[3,] 0.0003650635 -0.9999772572 0.006734371
> print(transposed$V)
[,1] [,2] [,3]
[1,] 0.0055505001 0.0067322416 -0.999961934
[2,] 0.9999845292 -0.0004024287 0.005547916
[3,] -0.0003650635 -0.9999772572 -0.006734371
要了解为什么可以在转置矩阵上工作,请考虑 以下内容:
特征值方程的一般形式是
A x = λ x (1)
不失一般性,让M
成为原文的中心“副本”
数据集X
。将M^T M
替换为A
会产生
M^T M x = λ x (2)
此等式乘以M
得出
M M^T M x = λ M x (3)
随后替换y = M x
产生
M M^T y = λ y (4)
人们已经可以看到y
对应于“协方差”的特征向量
转置数据集的矩阵(注意M M^T
实际上并不真实
协方差矩阵,因为数据集X
沿着它的列而不是它的中心
行。此外,必须通过样本数(M
行)进行缩放
而不是功能的数量(M
的{{1}}行M^T
列。
还可以看出,M M^T
和M^T M
的特征值是相同的。
最后,最后一次乘以M^T
会产生
(M^T M) M^T y = λ M^T y (5)
其中M^T M
是原始协方差矩阵。
从等式(5)可以得出M^T y
是M^T M
的特征向量
特征值λ
。