Question

在R中，Limma包可以为您提供差异表达基因的列表。

如何在阈值方面获得信号强度最高的所有probesets？
我是否可以在健康实验中仅获得表达最多的基因，例如来自一个.CEL文件？或者来自同一组.CEL个文件中的表达最多的基因（所有对照组，或所有样本组）。

如果你运行以下脚本，那一切都还可以。你有很多.CEL文件和所有工作。

source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GEOquery","affy","limma","gcrma"))
gse_number <- "GSE13887"
getGEOSuppFiles( gse_number )
COMPRESSED_CELS_DIRECTORY <- gse_number
untar( paste( gse_number , paste( gse_number , "RAW.tar" , sep="_") , sep="/" ), exdir=COMPRESSED_CELS_DIRECTORY)
cels <- list.files( COMPRESSED_CELS_DIRECTORY , pattern = "[gz]")
sapply( paste( COMPRESSED_CELS_DIRECTORY , cels, sep="/") , gunzip )
celData <- ReadAffy( celfile.path = gse_number )
gcrma.ExpressionSet <- gcrma(celData)

但是如果手动删除所有.CEL文件但只留下一个，则从头开始执行脚本，以便在celData对象中有1个样本：

> celData
AffyBatch object
size of arrays=1164x1164 features (17 kb)
cdf=HG-U133_Plus_2 (54675 affyids)
number of samples=1
number of genes=54675
annotation=hgu133plus2
notes=

然后你会收到错误：

Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) : 
  variable lengths differ (found for 'x')

如何从1 .CEL样本文件中获得表达最多的基因？

我找到了一个可能对我有用的库： panp 包。

但是，如果您运行以下脚本：

if(!require(panp)) { biocLite("panp") }
library(panp)
myGDS <- getGEO("GDS2697")
eset <- GDS2eSet(myGDS,do.log2=TRUE)
my_pa <- pa.calls(eset)

你会收到错误：

> my_pa <- pa.calls(eset)
Error in if (chip == "hgu133b") { : the argument has length zero

即使GDS的平台是图书馆所期望的。

如果您使用pa.call()作为参数运行gcrma.ExpressionSet，那么所有工作：

my_pa <- pa.calls(gcrma.ExpressionSet)
Processing 28 chips: ############################
Processing complete.

总之，如果您运行脚本，执行时会出现错误：

my_pa <- pa.calls(eset)

而不是在执行

时

my_pa <- pa.calls(gcrma.ExpressionSet)

为什么它们都是ExpressionSet？

> is(gcrma.ExpressionSet)
[1] "ExpressionSet"    "eSet"             "VersionedBiobase" "Versioned"       
> is(eset)
[1] "ExpressionSet"    "eSet"             "VersionedBiobase" "Versioned"

Answer 1

你的gcrma.ExpressionSet是一个类的对象＆＃34; ExpressionSet＆＃34 ;; Biobase插图

中描述了使用ExpressionSet对象

vignette("ExpressionSetIntroduction")

也可在Biobase landing page上找到。特别是，可以使用exprs(gcrma.ExpressionSet)提取汇总表达式值的矩阵。所以

> eset = gcrma.ExpressionSet  ## easier to display
> which(exprs(eset) == max(exprs(eset)), arr.ind=TRUE)
              row col
213477_x_at 22779  24
> sampleNames(eset)[24]
[1] "GSM349767.CEL"

使用justGCRMA()而不是ReadAffy作为更快，更有效的内存方式来获取ExpressionSet。

考虑在Bioconductor support site上询问有关Biocondcutor包的问题，您将从知识渊博的成员那里得到快速回复。

从R中的一个.CEL文件中获取表达最多的基因

1 个答案: