如何在r中的新列中拆分逗号

时间:2014-03-30 18:08:09

标签: r split comma

我有这个数据

CHOM POS REF ALT
1    121  A   AA,AT
2    254  GCGC  GCGCG,AGCG
3    214  C    T

我需要将ALT列拆分为

CHOM POS REF       ALT        ALT1    ALT2 ...
1    121  A        AA         AT        0
2    254  GCGC    GCGCG      AGCG       0
3    214   C        T         0         0

我试过了,但错误是

alt=x$ALT
strsplit(alt, ",")

注意: 有许多不同的ALT和REF,根据逗号的最大值为4。 如果有acomma只是把值0或NA

3 个答案:

答案 0 :(得分:4)

新答案

我会编写如下的函数来拆分列:

splitFun <- function(inVec, sep = ",", newName = "ALT", fill = NA) {
  if (!is.character(inVec)) inVec <- as.character(inVec)
  X <- strsplit(inVec, sep, fixed = TRUE)
  cols <- vapply(X, length, 1L)
  M <- matrix(
    fill, nrow = length(inVec), ncol = max(cols),
    dimnames = list(NULL, make.unique(rep(newName, max(cols)), sep="")))
  M[cbind(rep(sequence(length(X)), cols), sequence(cols))] <- 
    unlist(X, use.names=FALSE)
  M
}

用法很简单:

splitFun(mydf$ALT)  ## Modify default arguments accordingly
#      ALT     ALT1   ALT2
# [1,] "AA"    "AT"   NA  
# [2,] "GCGCG" "AGCG" NA  
# [3,] "GCGCG" "AT"   "AA"
cbind(mydf, splitFun(mydf$ALT))
#   CHOM POS  REF         ALT   ALT ALT1 ALT2
# 1    1 121    A       AA,AT    AA   AT <NA>
# 2    2 254 GCGC  GCGCG,AGCG GCGCG AGCG <NA>
# 3    1 123 GCGC GCGCG,AT,AA GCGCG   AT   AA

时机应该非常有效。这是与“splitstackshape”方法(也可以处理不平衡情况)的比较。

system.time(splitstackshape:::read.concat(
  bigDf$ALT, sep=",", col.prefix="ALT"))
#    user  system elapsed 
#   1.197   0.000   1.202 
system.time(splitFun(bigDf$ALT))
#    user  system elapsed 
#   0.069   0.000   0.068 

对于上述情况,使用的样本数据是:

mydf <- data.frame(CHOM = c(1, 2, 1), POS = c(121, 254, 123), 
                   REF = c("A", "GCGC", "GCGC"), 
                   ALT = c("AA,AT", "GCGCG,AGCG", "GCGCG,AT,AA"))
mydf
#   CHOM POS  REF         ALT
# 1    1 121    A       AA,AT
# 2    2 254 GCGC  GCGCG,AGCG
# 3    1 123 GCGC GCGCG,AT,AA

bigDf <- do.call(rbind, replicate(10000, mydf, simplify = FALSE))

旧答案

您可以从我的“splitstackshape”软件包中尝试concat.split

library(splitstackshape)
concat.split(mydf, "ALT", ",")  ## Add `drop = TRUE` to drop the original column
#   CHOM POS  REF        ALT ALT_1 ALT_2
# 1    1 121    A      AA,AT    AA    AT
# 2    2 254 GCGC GCGCG,AGCG GCGCG  AGCG

“reshape2”包中还有colsplit

library(reshape2)
colsplit(as.character(mydf$ALT), ",", c("ALT", "ALT1"))
#     ALT ALT1
# 1    AA   AT
# 2 GCGCG AGCG

您可以使用cbind将输出添加到原始数据集中。

答案 1 :(得分:2)

请考虑您的数据是dat

> dat2 <- data.frame(dat[, -4], sapply(strsplit(levels(dat$ALT), ","), cbind))
> colnames(dat2)[4:5] <- c("ALT", "ALT1")
> dat2
  CHOM POS  REF ALT  ALT1
1    1 121    A  AA GCGCG
2    2 254 GCGC  AT  AGCG

答案 2 :(得分:1)

> dat[ c("ALT", "ALT1")] <- read.table(text=as.character(dat$ALT), sep=",")
> dat
  CHOM POS  REF   ALT ALT1
1    1 121    A    AA   AT
2    2 254 GCGC GCGCG AGCG