我有一个dplyr汇总函数生成的这种性质的数据帧。
pos nuc sample total
23 A 10028_1#2 3
23 C 10028_1#2 1
23 G 10028_1#2 5129
23 T 10028_1#2 128
231 C 10028_1#2 4
231 T 10028_1#2 3123
.
.
这个数据与ggplot2的条形图给出了一个不均匀的'因为pos 231缺少相应样本名称的A和G总值。这些值缺失,由R以外的程序生成。
对于每个对应值,在每个位置为每个缺失的A,T,G,C插入0总计的惯用方法是什么?换句话说,我如何获得这个数据帧?
pos nuc sample total
23 A 10028_1#2 3
23 C 10028_1#2 1
23 G 10028_1#2 5129
23 T 10028_1#2 128
231 C 10028_1#2 4
231 T 10028_1#2 3123
231 G 10028_1#2 0
231 A 10028_1#2 0
答案 0 :(得分:2)
我们可以使用complete
tidyr
library(dplyr)
library(tidyr)
df1 %>%
complete(pos, nuc, nesting(sample), fill = list(total = 0))
# pos nuc sample total
# <int> <chr> <chr> <dbl>
#1 23 A 10028_1#2 3
#2 23 C 10028_1#2 1
#3 23 G 10028_1#2 5129
#4 23 T 10028_1#2 128
#5 231 A 10028_1#2 0
#6 231 C 10028_1#2 4
#7 231 G 10028_1#2 0
#8 231 T 10028_1#2 3123
或者我们可以使用expand.grid/merge
base R
transform(merge(expand.grid(lapply(df1[1:3], unique)),
df1, all.x=TRUE), total = replace(total, is.na(total), 0))
df1 <- structure(list(pos = c(23L, 23L, 23L, 23L, 231L, 231L),
nuc = c("A",
"C", "G", "T", "C", "T"), sample = c("10028_1#2", "10028_1#2",
"10028_1#2", "10028_1#2", "10028_1#2", "10028_1#2"), total = c(3L,
1L, 5129L, 128L, 4L, 3123L)), .Names = c("pos", "nuc", "sample",
"total"), class = "data.frame", row.names = c(NA, -6L))