Question

我正在尝试使用来自Brainarray的自定义基因级注释CDF（芯片定义文件）文件，对有关弥漫性大B细胞淋巴瘤的特定基因表达数据集进行RMA标准化。

不幸的是，RMA规范化表达矩阵是所有NaN，我不明白为什么。

数据集（GSE31312）可在GEO website免费获取，并使用Affymetrix HG-U133 Plus 2.0阵列平台。我正在使用affy包来执行RMA规范化。

由于问题是针对数据集的，因此遗憾的是，以下用于重现问题的R代码非常繁琐（2 GB下载，8.8 GB解压缩）。

设置工作目录。

setwd("~/Desktop/GEO")

加载所需的包。取消注释以安装软件包。

#source("http://bioconductor.org/biocLite.R")
#biocLite(pkgs = c("GEOquery", "affy", "AnnotationDbi", "R.utils"))
library("GEOquery") # To automatically download the data
library("affy")
library("R.utils") # For file handling

将阵列数据下载到工作目录。

files <- getGEOSuppFiles("GSE31312")

解开名为CEL的目录中的数据

#Sys.setenv(TAR = '/usr/bin/tar') # For (some) OS X uncommment this line
untar(tarfile = "GSE31312/GSE31312_RAW.tar", exdir = "CEL")

解压缩所有.gz文件

gz.files <- list.files("CEL", pattern = "\\.gz$", 
                       ignore.case = TRUE, full.names = TRUE)
for (file in gz.files)
  gunzip(file, skip = TRUE, remove = TRUE)


cel.files <- list.files("CEL", pattern = "\\.cel$", 
                       ignore.case = TRUE, full.names = TRUE)

下载，安装和加载自定义Brainarray Ensembl ENSG基因注释包

download.file(paste0("http://brainarray.mbni.med.umich.edu/Brainarray/",
                     "Database/CustomCDF/18.0.0/ensg.download/",
                     "hgu133plus2hsensgcdf_18.0.0.tar.gz"),
              destfile = "hgu133plus2hsensgcdf_18.0.0.tar.gz")
install.packages("hgu133plus2hsensgcdf_18.0.0.tar.gz",
                 repos = NULL, type = "source")
library(hgu133plus2hsensgcdf)

使用和不使用自定义CDF执行RMA规范化。

affy.rma <- justRMA(filenames = cel.files, verbose = TRUE)
ensg.rma <- justRMA(filenames = cel.files, verbose = TRUE,
                    cdfname = "HGU133Plus2_Hs_ENSG")

可以看出，函数在没有警告的情况下返回表达式矩阵exprs(ensg.ram)，其中表达式矩阵中的所有条目都是NaN。

sum(is.nan(exprs(ensg.rma))) # == prod(dim(ensg.rma)) == 9964482

有趣的是，使用标准CDF时，exprs(affy.rma)中有相当多的NaN。

# Show some NaNs
na.rows <- apply(is.na(exprs(affy.rma)), 1, any)
exprs(affy.rma)[which(na.rows)[1:10], 1:4]

# A particular bad probeset:
exprs(affy.rma)["1553575_at", ]   

# There are relatively few NaNs in total (but the really should be none)
sum(is.na(exprs(affy.rma)))  # == 12305

# Probesets of with all NaNs
sum(apply(is.na(exprs(affy.rma)), 1, all))

根本不应该有NaN。我已尝试使用expresso函数仅执行背景校正（没有规范化和汇总），这也会产生NaN。因此问题似乎源于背景校正。但是，有人可能会担心它是一个或多个错误的数组。任何人都可以帮我追踪NaN的起源并获得一些有用的表达值吗？

谢谢，以及最好的问候

的Anders

编辑：单个文件似乎是问题（但不完全）

我决定检查每个.CEL文件是否独立规范化会发生什么。当justRMA给出一个我不确定的单个数组时，在RMA引擎盖下面实际发生了什么。但我想分位数归一化步骤成为同一性函数，并且摘要（中位数抛光）在第一次迭代后停止。

无论如何，为了执行逐个RMA规范化，我们运行：

ensg <- vector("list", length(cel.files))  # Initialize a list
names(ensg) <- basename(cel.files)
for (file in cel.files) {
  ensg[[basename(file)]] <- justRMA(filenames = file, verbose = TRUE,
                                    cdfname = "HGU133Plus2_Hs_ENSG")
  cat("File", which(file == cel.files), "done.\n\n")
}

# Extract the expression matrices for each file and combine them
X <- as.data.frame(do.call(cbind, lapply(ensg, exprs)))

通过查看head(X)，看来GSM776462.CEL是所有NaN。实际上，它几乎是：

sum(!is.nan(X$GSM776462.CEL)) # == 18

20009年只有18个不是NaN。接下来，我计算出现在其他地方的NaN数量

sum(is.na(X[, -grep("GSM776462.CEL", colnames(X))])) # == 0

这是0.因此，GSM776462.CEL似乎是罪魁祸首。

但是常规CDF注释没有任何问题：

affy <- justRMA(filenames = "CEL/GSM776462.CEL")
any(is.nan(exprs(affy))) # == FALSE

使用常规CDF在表达矩阵中看似随机分散的NaN也很奇怪。我仍然不知道该怎么做。

编辑2：当排除样本但不使用标准CDF

时，NaN消失

值得一提的是，当我在使用和不使用自定义CDF文件时排除GSM776462.CEL文件和RMA规范化时，NaN仅在自定义CDF情况下消失。

# Trying with all other CEL than GSM776462.CEL
cel.files2 <- grep("GSM776462.CEL", cel.files, invert = TRUE, value = TRUE)
affy.rma.no.776462 <- justRMA(filenames = cel.files2)
ensg.rma.no.776462 <- justRMA(filenames = cel.files2, verbose = TRUE,
                              cdfname = "HGU133Plus2_Hs_ENSG")

sum(is.na(exprs(affy.rma.no.776462))) # == 12275
sum(is.na(exprs(ensg.rma.no.776462))) # == 0

令人费解。

编辑3：“原始数据”中没有NA或NaN

对于它的价值，我试图读取原始探针级表达式值以检查它们是否包含NA或NaN。以下内容逐个浏览.CEL文件，并检查是否有任何缺失值。

for (file in cel.files) {
  affybtch <- suppressWarnings(read.affybatch(filenames = file))
  tmp <- exprs(affybtch)
  cat(file, "done.\n")
  if (any(is.na(tmp)))
    stop(paste("NAs or NaNs are present in", file))
}

找不到NA或NaN。

使用Brainarray定制CDF对GSE31312的RMA标准化中的所有NaN进行标准化

编辑：单个文件似乎是问题（但不完全）

编辑2：当排除样本但不使用标准CDF

编辑3：“原始数据”中没有NA或NaN

0 个答案: