dcast按长度聚合

时间:2016-02-11 17:45:19

标签: r reshape2

我正在尝试使用dcast将核苷酸频率从长格式转换为宽格式,如下所示:

res <- read.table(text='seqnames    pos strand  nucleotide  count   which_label V3  REF
1   134199222   -   A   NA  1:134199222-134199222   ENSMUST00000086465  T
                  1 134199222   -   A   NA  1:134199222-134199222   ENSMUST00000169927  T
                  1 134199222   -   A   NA  1:134199222-134199222   ENSMUST00000038191  T
                  1 134199222   -   A   NA  1:134199222-134199222   ENSMUST00000086465  T
                  1 134199222   -   A   NA  1:134199222-134199222   ENSMUST00000169927  T
                  1 134199222   -   A   NA  1:134199222-134199222   ENSMUST00000038191  T',header=T)

> res
seqnames       pos strand nucleotide count           which_label                 V3  REF
       1 134199222      -          A    NA 1:134199222-134199222 ENSMUST00000086465 TRUE
       1 134199222      -          A    NA 1:134199222-134199222 ENSMUST00000169927 TRUE
       1 134199222      -          A    NA 1:134199222-134199222 ENSMUST00000038191 TRUE
       1 134199222      -          A    NA 1:134199222-134199222 ENSMUST00000086465 TRUE
       1 134199222      -          A    NA 1:134199222-134199222 ENSMUST00000169927 TRUE
       1 134199222      -          A    NA 1:134199222-134199222 ENSMUST00000038191 TRUE

# change the levels so that even if there is no information, we get an output
res$strand <- factor(res$strand,levels=c('-','+'))
res$nucleotide <- factor(res$nucleotide,levels=c('A','T','G','C'))
res$seqnames <- factor(res$seqnames, levels=unique(res$seqnames))

# convert NAs to 0
# do not drop any missing rows
# get results for all possible nucleotide and strand even if absent
results <- dcast(res, seqnames+pos+V3~nucleotide+strand,
                 value.var = "count", fill = 0, drop=FALSE)

*Aggregation function missing: defaulting to length*

# results object looks like this

seqnames       pos                 V3 A_- A_+ T_- T_+ G_- G_+ C_- C_+
       1 134199222 ENSMUST00000038191   2   0   0   0   0   0   0   0
       1 134199222 ENSMUST00000086465   2   0   0   0   0   0   0   0
       1 134199222 ENSMUST00000169927   2   0   0   0   0   0   0   0

正如您所见,默认情况下dcast计算长度并在A_-中输出2,而我想要0,因为数据帧中有NA。我期待这样的事情:

seqnames       pos                 V3 A_- A_+ T_- T_+ G_- G_+ C_- C_+
       1 134199222 ENSMUST00000038191   0   0   0   0   0   0   0   0
       1 134199222 ENSMUST00000086465   0   0   0   0   0   0   0   0
       1 134199222 ENSMUST00000169927   0   0   0   0   0   0   0   0

即使我使用value.var = "count",为什么它仍然按长度聚合?任何帮助将不胜感激!

谢谢!

0 个答案:

没有答案