我正在尝试使用dcast重塑我的数据。我正在处理样品,每个样品有10-30个样品单位。我不能将我的数据聚合在一起。
我的数据采用以下格式:
ID total
sample_1 1
sample_1 0
sample_1 2
sample_1 1
sample_1 0
sample_1 0
sample_1 2
sample_1 1
sample_1 0
sample_1 2
sample_1 1
sample_1 4
sample_2 2
sample_2 1
sample_2 2
sample_2 0
sample_2 0
sample_2 0
sample_2 1
sample_2 2
sample_2 1
sample_2 4
sample_2 5
sample_2 2
sample_2 1
sample_3 0
sample_3 0
sample_3 1
sample_3 2
sample_3 1
sample_3 0
sample_3 2
sample_3 1
sample_3 4
sample_3 5
sample_3 1
sample_3 1
sample_3 0
sample_3 0
sample_3 1
我希望它看起来像:
sample_1 sample_2 sample_3
1 2 0
0 1 0
2 2 1
1 0 2
0 0 1
0 0 0
2 1 2
1 2 1
0 1 4
2 4 5
1 5 1
4 2 1
1 0
0
1
我的样本ID变成了不同的列。
我试过几种方式,但R继续聚合它。
答案 0 :(得分:1)
您可以使用dcast()
执行此操作,但必须为每个ID
添加行号。
data.table
包是除了reshape2
之外的另一个包,它实现了dcast()
。 data.table
有一个方便的rowid()
函数,可以在每个组中生成唯一的行ID。对此,我们得到:
library(data.table)
dcast(setDT(DF), rowid(ID) ~ ID, value.var = "total")
# ID sample_1 sample_2 sample_3
# 1: 1 1 2 0
# 2: 2 0 1 0
# 3: 3 2 2 1
# 4: 4 1 0 2
# 5: 5 0 0 1
# 6: 6 0 0 0
# 7: 7 2 1 2
# 8: 8 1 2 1
# 9: 9 0 1 4
#10: 10 2 4 5
#11: 11 1 5 1
#12: 12 4 2 1
#13: 13 NA 1 0
#14: 14 NA NA 0
#15: 15 NA NA 1
但是,我建议以长格式继续进行任何数据处理并使用分组。这比处理单个列要容易得多。例如,
# count observations by group
DF[, .N, by = ID]
# ID N
#1: sample_1 12
#2: sample_2 13
#3: sample_3 15
# compute mean by group
DF[, mean(total), by = ID]
# ID V1
#1: sample_1 1.166667
#2: sample_2 1.615385
#3: sample_3 1.266667
# get min and max by group
DF[, .(min = min(total), max = max(total)), by = ID]
# ID min max
#1: sample_1 0 4
#2: sample_2 0 5
#3: sample_3 0 5
# the same using range()
DF[, as.list(range(total)), by = ID]
# ID V1 V2
#1: sample_1 0 4
#2: sample_2 0 5
#3: sample_3 0 5
DF <- structure(list(ID = c("sample_1", "sample_1", "sample_1", "sample_1",
"sample_1", "sample_1", "sample_1", "sample_1", "sample_1", "sample_1",
"sample_1", "sample_1", "sample_2", "sample_2", "sample_2", "sample_2",
"sample_2", "sample_2", "sample_2", "sample_2", "sample_2", "sample_2",
"sample_2", "sample_2", "sample_2", "sample_3", "sample_3", "sample_3",
"sample_3", "sample_3", "sample_3", "sample_3", "sample_3", "sample_3",
"sample_3", "sample_3", "sample_3", "sample_3", "sample_3", "sample_3"
), total = c(1L, 0L, 2L, 1L, 0L, 0L, 2L, 1L, 0L, 2L, 1L, 4L,
2L, 1L, 2L, 0L, 0L, 0L, 1L, 2L, 1L, 4L, 5L, 2L, 1L, 0L, 0L, 1L,
2L, 1L, 0L, 2L, 1L, 4L, 5L, 1L, 1L, 0L, 0L, 1L)), .Names = c("ID",
"total"), row.names = c(NA, -40L), class = "data.frame")