将函数应用于R data.table对象,该对象将更新多个列并创建多个行

时间:2014-12-18 18:47:44

标签: r data.table

我的起始表如下:

   CHROM     POS                       REF                                 ALT  GT
1:     1   58211                         A                                   G 1/1
2:     1 6464767 CAAATAAATAAATAAATAAATAAAT C,CAAATAAATAAATAAATAAATAAATAAATAAAT 1/2
3:    12   83011                         T                                   C 0/1
4:    18 1541042                         C                                 T,A 1/2

我想申请一个功能" ap2"这会将第2行的长REF和ALT条目分成两个较短的条目,更新第2行的数据(更改REF,ALT和GT)并插入新行(#3使用新的POS,ALT和GT)。结果将如下所示:

   CHROM     POS                       REF                                 ALT  GT
1:     1   58211                         A                                   G 1/1
2:     1 6464767 CAAATAAATAAATAAATAAATAAAT                                   C 1/2
3:     1 6464791                         T                           TAAATAAAT 1/2
4:    12   83011                         T                                   C 0/1
5:    18 1541042                         C                                 T,A 1/2

如果我运行ap2功能,它会显示预期的结果(列V1-V4):

tmp[,ap2(POS,REF,ALT,GT), by=c("CHROM","POS","REF","ALT","GT")]
   CHROM     POS                       REF                                 ALT  GT      V1                        V2        V3  V4
1:     1 6464767 CAAATAAATAAATAAATAAATAAAT C,CAAATAAATAAATAAATAAATAAATAAATAAAT 1/2 6464767 CAAATAAATAAATAAATAAATAAAT         C 0/1
2:     1 6464767 CAAATAAATAAATAAATAAATAAAT C,CAAATAAATAAATAAATAAATAAATAAATAAAT 1/2 6464791                         T TAAATAAAT 0/1
3:    18 1541042                         C                                 T,A 1/2 1541042                         C       T,A 1/2

但是,如果我尝试更新原始列,我会收到错误:

tmp[, c("POS","REF","ALT","GT") := ap2(POS,REF,ALT,GT), by=c("CHROM","POS","REF","ALT","GT")]
Warning messages:
1: In `[.data.table`(tmp, , `:=`(c("POS", "REF", "ALT", "GT"), ap2(POS,  :
  RHS 1 is length 2 (greater than the size (1) of group 2). The last 1 element(s) will be discarded.
2: In `[.data.table`(tmp, , `:=`(c("POS", "REF", "ALT", "GT"), ap2(POS,  :
  RHS 2 is length 2 (greater than the size (1) of group 2). The last 1 element(s) will be discarded.
3: In `[.data.table`(tmp, , `:=`(c("POS", "REF", "ALT", "GT"), ap2(POS,  :
  RHS 3 is length 2 (greater than the size (1) of group 2). The last 1 element(s) will be discarded.
4: In `[.data.table`(tmp, , `:=`(c("POS", "REF", "ALT", "GT"), ap2(POS,  :
  RHS 4 is length 2 (greater than the size (1) of group 2). The last 1 element(s) will be discarded.

以下是创建我的tmp data.table

的代码
data.table(
  CHROM = as.character(c("1","1","12","18")) ,
  POS = as.integer(c(58211,6464767,83011,1541042)) ,
  REF = c("A","CAAATAAATAAATAAATAAATAAAT","T","C") ,
  ALT = c("G","C,CAAATAAATAAATAAATAAATAAATAAATAAAT","C","T,A") ,
  GT = c("1/1","1/2","0/1","1/2")
)

这是我试图应用的功能:

ap2 <- function(pos,ref,alt,gt) {
  if(gt=="1/2") {
    alt.split <- unlist(strsplit(alt,","))
    matching <- attr(regexpr(ref,alt.split), "match.length")
    if(max(matching) == -1) {
      list(pos,ref,alt,gt)
    } else {
      alt.new <- NULL
      ref.new <- NULL
      pos.new <- NULL
      gt.new <- NULL
      for(i in 1:length(matching)) {
        stopPos <- matching[i]
        if(stopPos == -1) {
          pos.new <- c(pos.new,as.integer(pos))
          ref.new <- c(ref.new,ref)
          alt.new <- c(alt.new,alt.split[i])
        } else {
          pos.new <- c(pos.new, as.integer(pos+matching[i]-1))
          ref.new <- c(ref.new, substring(ref,stopPos))
          alt.new <- c(alt.new, substring(alt.split[i],stopPos))
        }
        gt.new <- c(gt.new, "0/1")
      }
      list(pos.new, ref.new, alt.new, gt.new)
    }
  }
}

0 个答案:

没有答案