R:在数据帧之间运行逐行操作

时间:2015-02-02 19:23:53

标签: r for-loop

我想在两个数据框gexmxy之间运行逐行匹配的统计测试。问题是我需要多次运行它,每次都使用gex中的不同列,每次运行都会产生不同的测试结果向量。

在@kristang的帮助下,这是我到目前为止所使用的(使用示例值)。

gex <- data.frame("sample" =  c(987,7829,15056,15058,15072), 
                  "TCGA-F4-6703-01" = runif(5, -1, 1),
                  "TCGA-DM-A28E-01" = runif(5, -1, 1),
                  "TCGA-AY-6197-01" = runif(5, -1, 1),
                  "TCGA-A6-5657-01" = runif(5, -1, 1))
colnames(gex) <- gsub("[.]", "_",colnames(gex))

listx <- c("TCGA_DM_A28E_01","TCGA_A6_5657_01")

mxy <- data.frame("TCGA-AD-6963-01" = runif(5, -1, 1),
                  "TCGA-AA-3663-11" = runif(5, -1, 1),
                  "TCGA-AD-6901-01" = runif(5, -1, 1),
                  "TCGA-AZ-2511-01" = runif(5, -1, 1),
                  "TCGA-A6-A567-01" = runif(5, -1, 1)) 

colnames(mxy) <- gsub("[.]", "_",colnames(mxy))

zScore <- function(x,y)((as.numeric(x) - as.numeric(rowMeans(y,na.rm=T)))/as.numeric(sd(y,na.rm=T)))

## BELOW IS FOR DIAGNOSTICS

write.table(mxy, file = "mxy.csv", 
            row.names=FALSE, col.names=TRUE, sep=",", quote=F)

write.table(gex, file = "gex.csv", 
            row.names=FALSE, col.names=TRUE, sep=",", quote=F)

## ABOVE IS FOR DIAGNOSTICS

for(i in seq(nrow(mxy)))
  for(colName in listx){

    zvalues <- zScore(gex[,colName[colName %in% names(gex)]],
                      mxy[i,])

    ## BELOW IS FOR DIAGNOSTICS

    write.table(gex[,colName[colName %in% names(gex)]], file=paste0(colName, "column", ".csv"),
                row.names=FALSE,col.names=FALSE,sep=",",quote=F)

    write.table(mxy[i,], file=paste0(colName, "mxyinput", ".csv"),
                row.names=FALSE,col.names=FALSE,sep=",",quote=F)

    ## ABOVE IS FOR DIAGNOSTICS

    geneexptest <- data.frame(gex$sample, zvalues, row.names = NULL, 
                              stringsAsFactors = FALSE)
    write.csv(geneexptest, file = paste0(colName, ".csv"), 
              row.names=FALSE, col.names=FALSE, sep=",", quote=F)
  }

问题是虽然它似乎经历并创建具有正确行数等的正确数量的输出文件...但它不会产生正确的z分数。我想要它来计算:

((来自行z和给定的gex列的值) - (跨越mxy的行z中的值的平均值))/(跨越mxy的行z中的值的标准偏差)

然后转到下一行,依此类推,填入第一个向量。那么,我希望它使用gex的下一列计算相同的东西,填入一个单独的向量。我希望这是有道理的。

我有一个单独的脚本,它使用预先确定的列与其他数据帧运行相同的测试。该脚本的相关for循环如下所示:

for(i in seq_along(mxy)){
  zvalues[i] <- (gex_column_W[i] - mean(mxy[i,])) / sd(mxy[i,])
}

1 个答案:

答案 0 :(得分:0)

我认为你的代码中可能会出现拼写错误,特别是你说你想要&#34; mxy&#34;行z中值的平均值但是正在使用选择第i列的mean(mxy[,i])),而不是第i行。为清楚起见,我用for循环重写了这一部分。 (不确定为什么使用lapply?)

# a function fo calculationg the z score
zScore <- function(x,y)(x - mean(y,na.rm=T))/sd(y,na.rm=T)

for(i in seq(nrow(mxy))) # note that length(mxy) is actually the number of columns in mxy
for(colName in listx){
    zvalues <- zScore(gex[,colName],# column == colName
                      mxy[i,])# row == i
    geneexptest <- data.frame(gex$sample, zvalues, row.names = NULL, 
                          stringsAsFactors = FALSE)
    write.table(geneexptest, file = paste0(colName, "mxyinput", ".csv"),
                row.names=FALSE, col.names=FALSE,  quote=F,
                sep = ",", dec = ".", append=(i > 1))

}

以及不依赖于append的替代方案:

for(colName in listx){
    geneexptest <- NULL
    for(i in seq(nrow(mxy))) {
        zvalues <- zScore(gex[,colName],# column == colName
                          mxy[i,])# row == i
        geneexptest <- rbind(geneexptest,
                            data.frame(gex$sample, zvalues, row.names = NULL, 
                              stringsAsFactors = FALSE))
    }
    write.table(geneexptest, file = paste0(colName, "mxyinput", ".csv"),
                row.names=FALSE, col.names=FALSE,  quote=F,
                sep = ",", dec = ".", append=(i > 1))
}