通过在R

时间:2015-06-30 04:39:38

标签: r

我有这个数据帧result(在dput中提供),其中我有A,C,G,T,N和X列。这些字母A,C,G,T,N和X也存在于列ALT1至ALTn中。在这个特定的例子中,我有ALT列,范围从ALT1到ALT2,但有些情况下它可以达到ALTn。现在,这就是我想要做的 - 我想将ALT1,ALT2..ALTn中的字母与A,C,G,T,N和X列中的字母(基本上是列号)相匹配并提取相应的数字这些列中的值与REF列中的字母以及A,C,G,T,N和X列中的相应数值一起粘贴。我已经写了这个for循环(我的循环),它只是为ALT2列做这个工作并将结果存储在final2中。我想用这个循环创建一个函数,所以这可以扩展到所有列,最终以final1,final2,final3..finaln的形式给出结果(请参阅final1的预期结果)。然后我想将final1通过finaln粘贴在一起。如何在R中完成此操作?

dput(result):

structure(list(start = c("chr1:101544447", "chr1:102053031", 
"chr1:102778767", "chr1:102789831", "chr1:102989480", "chr1:103310574", 
"chr1:103870326"), A = c(NA, NA, NA, NA, NA, 2L, NA), C = c(NA, 
34L, 24L, NA, NA, 22L, 12L), G = c(NA, NA, NA, NA, NA, NA, NA
), T = c(53L, NA, NA, 30L, 12L, NA, NA), N = c(NA, NA, NA, NA, 
NA, NA, NA), X. = c(NA, NA, NA, NA, NA, NA, NA), X..1 = c(NA, 
NA, NA, NA, NA, NA, NA), end = c(101544447L, 102053031L, 102778767L, 
102789831L, 102989480L, 103310574L, 103870326L), REF = c("A", 
"C", "C", "C", "C", "C", "C"), ALT = c("T", "G", "T", "T", "T", 
"A", "A"), ALT1 = c("T", "G", "T", "T", "T", "A", "A"), ALT2 = c(NA, 
NA, NA, NA, NA, NA, NA), TYPE = c("snp", "snp", "snp", "snp", 
"snp", "snp", "snp")), .Names = c("start", "A", "C", "G", "T", 
"N", "X.", "X..1", "end", "REF", "ALT", "ALT1", "ALT2", "TYPE"
), class = "data.frame", row.names = c(NA, -7L))
  

我的循环

final1 <- {}
    i <- 1

    for(i in 1:nrow(result)){
      final1[i] = paste(paste(result$chr[i], result$start[i], result$end[i],sep=":"),"-", 
                       result$REF[i],"(",result[,(as.character(result$REF[i]))][i],")",",", result$ALT1[i],
                       "(",result[,(as.character(result$ALT1[i]))][i][!is.na(result[,(as.character(result$ALT1[i]))][i])],")",sep="")

    }

    final1

预期输出(仅适用于ALT1列):

> final1
[1] ":chr1:101544447:101544447-A(NA),T(53)" ":chr1:102053031:102053031-C(34),G()"   ":chr1:102778767:102778767-C(24),T()"  
[4] ":chr1:102789831:102789831-C(NA),T(30)" ":chr1:102989480:102989480-C(NA),T(12)" ":chr1:103310574:103310574-C(22),A(2)" 
[7] ":chr1:103870326:103870326-C(12),A()"  

1 个答案:

答案 0 :(得分:1)

从您的数据开始,我稍微对其进行了修改,以提供实际可用的ALT2列(即,并非所有NA s):

## result, as defined above
set.seed(42)
result$ALT2 <- sample(c('A','C','G','T'), size=nrow(result), replace=TRUE)
result
##            start  A  C  G  T  N X. X..1       end REF ALT ALT1 ALT2 TYPE
## 1 chr1:101544447 NA NA NA 53 NA NA   NA 101544447   A   T    T    T  snp
## 2 chr1:102053031 NA 34 NA NA NA NA   NA 102053031   C   G    G    T  snp
## 3 chr1:102778767 NA 24 NA NA NA NA   NA 102778767   C   T    T    C  snp
## 4 chr1:102789831 NA NA NA 30 NA NA   NA 102789831   C   T    T    T  snp
## 5 chr1:102989480 NA NA NA 12 NA NA   NA 102989480   C   T    T    G  snp
## 6 chr1:103310574  2 22 NA NA NA NA   NA 103310574   C   A    A    G  snp
## 7 chr1:103870326 NA 12 NA NA NA NA   NA 103870326   C   A    A    G  snp

从这里开始,我首先找到所有想要的ALT列:

alts <- colnames(result)[grepl('ALT[0-9]+', colnames(result))]
alts
## [1] "ALT1" "ALT2"

接下来,无论所需列数多少,都可以一次性完成所有ALT的代码。它比严格要求的更冗长,但它有助于了解事情是如何按组件分解的。

ret <- t(sapply(1:nrow(result), function(r) {
    dat <- result[r,]
    part1 <- paste(c('', dat[,c('start','end')]), collapse=':')
    part2 <- sprintf('%s(%s)', dat$REF, dat[ dat$REF ])
    part3 <- sapply(alts, function(alt) sprintf('%s(%s)', dat[[alt]], dat[ dat[[alt]] ]) )
    part23 <- paste(part2, part3, sep=',')
    paste(part1, part23, sep='-')
}))
colnames(ret) <- alts
ret
##      ALT1                                    ALT2                                   
## [1,] ":chr1:101544447:101544447-A(NA),T(53)" ":chr1:101544447:101544447-A(NA),T(53)"
## [2,] ":chr1:102053031:102053031-C(34),G(NA)" ":chr1:102053031:102053031-C(34),T(NA)"
## [3,] ":chr1:102778767:102778767-C(24),T(NA)" ":chr1:102778767:102778767-C(24),C(24)"
## [4,] ":chr1:102789831:102789831-C(NA),T(30)" ":chr1:102789831:102789831-C(NA),T(30)"
## [5,] ":chr1:102989480:102989480-C(NA),T(12)" ":chr1:102989480:102989480-C(NA),G(NA)"
## [6,] ":chr1:103310574:103310574-C(22),A(2)"  ":chr1:103310574:103310574-C(22),G(NA)"
## [7,] ":chr1:103870326:103870326-C(12),A(NA)" ":chr1:103870326:103870326-C(12),G(NA)"

说明:

  • 外部t(sapply(...))用于迭代data.frame的每一行。 t(...)是必要的,因为否则它会从您的预期旋转。
  • part1part2都为该行创建了一个字符串。由于连续创建的每个ALT都有这些组件的共同点,因此只需要创建它们一次。
  • part3创建的字符串与ALT s。
  • 一样多
  • part23仅将part2part3中的每一个合并。

在控制台上设置r <- 1并手动逐步完成此过程可能会提供信息,在创建变量时检查变量。

最后,你说你希望(由于某个原因让我失望)将ALT的每个字符串组合成一个字符串。你可以用:

来做到这一点
apply(ret, 1, paste, collapse='')
## [1] ":chr1:101544447:101544447-A(NA),T(53):chr1:101544447:101544447-A(NA),T(53)"
## [2] ":chr1:102053031:102053031-C(34),G(NA):chr1:102053031:102053031-C(34),T(NA)"
## [3] ":chr1:102778767:102778767-C(24),T(NA):chr1:102778767:102778767-C(24),C(24)"
## [4] ":chr1:102789831:102789831-C(NA),T(30):chr1:102789831:102789831-C(NA),T(30)"
## [5] ":chr1:102989480:102989480-C(NA),T(12):chr1:102989480:102989480-C(NA),G(NA)"
## [6] ":chr1:103310574:103310574-C(22),A(2):chr1:103310574:103310574-C(22),G(NA)" 
## [7] ":chr1:103870326:103870326-C(12),A(NA):chr1:103870326:103870326-C(12),G(NA)"

顺便说一句:我不确定你为什么在第一个chr1:...之前放置一个前导结肠。如果它是预期最终合并,可以通过将sapply代码中的一行更改为:

来更好地实现
part1 <- paste(dat[,c('start','end')], collapse=':')

和最后一行:

apply(ret, 1, paste, collapse=':')

但也许你有理由不清楚。

干杯!

编辑:将其封装在一个函数中应该是微不足道的:

func <- function(result) {
    alts <- ...
    ret <- t(sapply(...
    colnames(ret) <- alts
    apply(ret, 1, paste, collapse='')
}
func(result)

编辑#2 :随着需求列表的增加,我觉得我正处于螺旋式开发政府合同中; - )

回到未经修改的数据(未经修改ALT2),我修改了ALT2一个,以便测试此代码是否符合我的要求。推断意图是:

result$ALT2[5] <- 'A'

...现在修改后的代码,一体化:

ret <- sapply(1:nrow(result), function(r) {
    dat <- result[r,]
    part1 <- paste(c('', dat[,c('start','end')]), collapse=':')
    part2 <- sprintf('%s(%s)', dat$REF, dat[ dat$REF ])
    part3 <- unlist(sapply(alts, function(alt) {
        if (is.na(dat[[alt]])) NULL
        else sprintf('%s(%s)', dat[[alt]], dat[ dat[[alt]] ])
    }))
    part23 <- paste(part2, part3, sep=',')
    part123 <- paste(part1, part23, sep='-', collapse='')
})
ret
## [1] ":chr1:101544447:101544447-A(NA),T(53)"                                     
## [2] ":chr1:102053031:102053031-C(34),G(NA)"                                     
## [3] ":chr1:102778767:102778767-C(24),T(NA)"                                     
## [4] ":chr1:102789831:102789831-C(NA),T(30)"                                     
## [5] ":chr1:102989480:102989480-C(NA),T(12):chr1:102989480:102989480-C(NA),A(NA)"
## [6] ":chr1:103310574:103310574-C(22),A(2)"                                      
## [7] ":chr1:103870326:103870326-C(12),A(NA)"