我有这个数据帧result
(在dput中提供),其中我有A,C,G,T,N和X列。这些字母A,C,G,T,N和X也存在于列ALT1至ALTn中。在这个特定的例子中,我有ALT列,范围从ALT1到ALT2,但有些情况下它可以达到ALTn。现在,这就是我想要做的 - 我想将ALT1,ALT2..ALTn中的字母与A,C,G,T,N和X列中的字母(基本上是列号)相匹配并提取相应的数字这些列中的值与REF列中的字母以及A,C,G,T,N和X列中的相应数值一起粘贴。我已经写了这个for循环(我的循环),它只是为ALT2列做这个工作并将结果存储在final2中。我想用这个循环创建一个函数,所以这可以扩展到所有列,最终以final1,final2,final3..finaln的形式给出结果(请参阅final1的预期结果)。然后我想将final1通过finaln粘贴在一起。如何在R中完成此操作?
dput(result
):
structure(list(start = c("chr1:101544447", "chr1:102053031",
"chr1:102778767", "chr1:102789831", "chr1:102989480", "chr1:103310574",
"chr1:103870326"), A = c(NA, NA, NA, NA, NA, 2L, NA), C = c(NA,
34L, 24L, NA, NA, 22L, 12L), G = c(NA, NA, NA, NA, NA, NA, NA
), T = c(53L, NA, NA, 30L, 12L, NA, NA), N = c(NA, NA, NA, NA,
NA, NA, NA), X. = c(NA, NA, NA, NA, NA, NA, NA), X..1 = c(NA,
NA, NA, NA, NA, NA, NA), end = c(101544447L, 102053031L, 102778767L,
102789831L, 102989480L, 103310574L, 103870326L), REF = c("A",
"C", "C", "C", "C", "C", "C"), ALT = c("T", "G", "T", "T", "T",
"A", "A"), ALT1 = c("T", "G", "T", "T", "T", "A", "A"), ALT2 = c(NA,
NA, NA, NA, NA, NA, NA), TYPE = c("snp", "snp", "snp", "snp",
"snp", "snp", "snp")), .Names = c("start", "A", "C", "G", "T",
"N", "X.", "X..1", "end", "REF", "ALT", "ALT1", "ALT2", "TYPE"
), class = "data.frame", row.names = c(NA, -7L))
我的循环
final1 <- {}
i <- 1
for(i in 1:nrow(result)){
final1[i] = paste(paste(result$chr[i], result$start[i], result$end[i],sep=":"),"-",
result$REF[i],"(",result[,(as.character(result$REF[i]))][i],")",",", result$ALT1[i],
"(",result[,(as.character(result$ALT1[i]))][i][!is.na(result[,(as.character(result$ALT1[i]))][i])],")",sep="")
}
final1
预期输出(仅适用于ALT1列):
> final1
[1] ":chr1:101544447:101544447-A(NA),T(53)" ":chr1:102053031:102053031-C(34),G()" ":chr1:102778767:102778767-C(24),T()"
[4] ":chr1:102789831:102789831-C(NA),T(30)" ":chr1:102989480:102989480-C(NA),T(12)" ":chr1:103310574:103310574-C(22),A(2)"
[7] ":chr1:103870326:103870326-C(12),A()"
答案 0 :(得分:1)
从您的数据开始,我稍微对其进行了修改,以提供实际可用的ALT2
列(即,并非所有NA
s):
## result, as defined above
set.seed(42)
result$ALT2 <- sample(c('A','C','G','T'), size=nrow(result), replace=TRUE)
result
## start A C G T N X. X..1 end REF ALT ALT1 ALT2 TYPE
## 1 chr1:101544447 NA NA NA 53 NA NA NA 101544447 A T T T snp
## 2 chr1:102053031 NA 34 NA NA NA NA NA 102053031 C G G T snp
## 3 chr1:102778767 NA 24 NA NA NA NA NA 102778767 C T T C snp
## 4 chr1:102789831 NA NA NA 30 NA NA NA 102789831 C T T T snp
## 5 chr1:102989480 NA NA NA 12 NA NA NA 102989480 C T T G snp
## 6 chr1:103310574 2 22 NA NA NA NA NA 103310574 C A A G snp
## 7 chr1:103870326 NA 12 NA NA NA NA NA 103870326 C A A G snp
从这里开始,我首先找到所有想要的ALT
列:
alts <- colnames(result)[grepl('ALT[0-9]+', colnames(result))]
alts
## [1] "ALT1" "ALT2"
接下来,无论所需列数多少,都可以一次性完成所有ALT
的代码。它比严格要求的更冗长,但它有助于了解事情是如何按组件分解的。
ret <- t(sapply(1:nrow(result), function(r) {
dat <- result[r,]
part1 <- paste(c('', dat[,c('start','end')]), collapse=':')
part2 <- sprintf('%s(%s)', dat$REF, dat[ dat$REF ])
part3 <- sapply(alts, function(alt) sprintf('%s(%s)', dat[[alt]], dat[ dat[[alt]] ]) )
part23 <- paste(part2, part3, sep=',')
paste(part1, part23, sep='-')
}))
colnames(ret) <- alts
ret
## ALT1 ALT2
## [1,] ":chr1:101544447:101544447-A(NA),T(53)" ":chr1:101544447:101544447-A(NA),T(53)"
## [2,] ":chr1:102053031:102053031-C(34),G(NA)" ":chr1:102053031:102053031-C(34),T(NA)"
## [3,] ":chr1:102778767:102778767-C(24),T(NA)" ":chr1:102778767:102778767-C(24),C(24)"
## [4,] ":chr1:102789831:102789831-C(NA),T(30)" ":chr1:102789831:102789831-C(NA),T(30)"
## [5,] ":chr1:102989480:102989480-C(NA),T(12)" ":chr1:102989480:102989480-C(NA),G(NA)"
## [6,] ":chr1:103310574:103310574-C(22),A(2)" ":chr1:103310574:103310574-C(22),G(NA)"
## [7,] ":chr1:103870326:103870326-C(12),A(NA)" ":chr1:103870326:103870326-C(12),G(NA)"
说明:
t(sapply(...))
用于迭代data.frame的每一行。 t(...)
是必要的,因为否则它会从您的预期旋转。part1
和part2
都为该行创建了一个字符串。由于连续创建的每个ALT
都有这些组件的共同点,因此只需要创建它们一次。part3
创建的字符串与ALT
s。part23
仅将part2
与part3
中的每一个合并。在控制台上设置r <- 1
并手动逐步完成此过程可能会提供信息,在创建变量时检查变量。
最后,你说你希望(由于某个原因让我失望)将ALT
的每个字符串组合成一个字符串。你可以用:
apply(ret, 1, paste, collapse='')
## [1] ":chr1:101544447:101544447-A(NA),T(53):chr1:101544447:101544447-A(NA),T(53)"
## [2] ":chr1:102053031:102053031-C(34),G(NA):chr1:102053031:102053031-C(34),T(NA)"
## [3] ":chr1:102778767:102778767-C(24),T(NA):chr1:102778767:102778767-C(24),C(24)"
## [4] ":chr1:102789831:102789831-C(NA),T(30):chr1:102789831:102789831-C(NA),T(30)"
## [5] ":chr1:102989480:102989480-C(NA),T(12):chr1:102989480:102989480-C(NA),G(NA)"
## [6] ":chr1:103310574:103310574-C(22),A(2):chr1:103310574:103310574-C(22),G(NA)"
## [7] ":chr1:103870326:103870326-C(12),A(NA):chr1:103870326:103870326-C(12),G(NA)"
顺便说一句:我不确定你为什么在第一个chr1:...
之前放置一个前导结肠。如果它是预期最终合并,可以通过将sapply
代码中的一行更改为:
part1 <- paste(dat[,c('start','end')], collapse=':')
和最后一行:
apply(ret, 1, paste, collapse=':')
但也许你有理由不清楚。
干杯!
编辑:将其封装在一个函数中应该是微不足道的:
func <- function(result) {
alts <- ...
ret <- t(sapply(...
colnames(ret) <- alts
apply(ret, 1, paste, collapse='')
}
func(result)
编辑#2 :随着需求列表的增加,我觉得我正处于螺旋式开发政府合同中; - )
回到未经修改的数据(未经修改ALT2
),我修改了ALT2
的一个,以便测试此代码是否符合我的要求。推断意图是:
result$ALT2[5] <- 'A'
...现在修改后的代码,一体化:
ret <- sapply(1:nrow(result), function(r) {
dat <- result[r,]
part1 <- paste(c('', dat[,c('start','end')]), collapse=':')
part2 <- sprintf('%s(%s)', dat$REF, dat[ dat$REF ])
part3 <- unlist(sapply(alts, function(alt) {
if (is.na(dat[[alt]])) NULL
else sprintf('%s(%s)', dat[[alt]], dat[ dat[[alt]] ])
}))
part23 <- paste(part2, part3, sep=',')
part123 <- paste(part1, part23, sep='-', collapse='')
})
ret
## [1] ":chr1:101544447:101544447-A(NA),T(53)"
## [2] ":chr1:102053031:102053031-C(34),G(NA)"
## [3] ":chr1:102778767:102778767-C(24),T(NA)"
## [4] ":chr1:102789831:102789831-C(NA),T(30)"
## [5] ":chr1:102989480:102989480-C(NA),T(12):chr1:102989480:102989480-C(NA),A(NA)"
## [6] ":chr1:103310574:103310574-C(22),A(2)"
## [7] ":chr1:103870326:103870326-C(12),A(NA)"