我有一个包含10列和2000行的data_frame。我的示例数据如下所示:
rs_id Code Combination_Ag A.Ag Combination_Bg B.Ag Combination_Cg C.Ag
rs_1 0 1:01/1:01 1 13:02/13:02 1 03:04/03:04 6 1:01/1:01 1
rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/15:02 1 1:01/15:01 1
rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1
rs_1 2 1:01/2:05: 1 13:02/8:01 1 06:02/06:02 3 1:01/4:04 1
rs_1 2 1:01/24:02 3 14:01/14:02 1 06:02/15:02 1 1:01/4:04 3
rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1
rs_2 1 1:01/31:01 1 15:01/15:01 1 06:02/3:03 1 1:01/7:01 2
rs_2 1 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1 10:01/14:01 1
rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5
我试图找到rs_id = 0,1和2的最高组合(A.Ag,B.Bg C.Ag)。我怎样才能实现这一点? 输出将是
rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/03:04 6 1:01/1:01 1
rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1
rs_1 2 1:01/24:02 3 06:02/06:02 3 06:02/15:02 1 1:01/4:04 3
rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1
rs_2 1 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1 10:01/14:01 1
rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5
答案 0 :(得分:3)
此方法将数据从宽格式转换为长格式(同时熔化两个度量列),为Ag
的每个唯一组合选择具有最高rs_id
值的行,Code
和variable
。最后,结果再次从长格式转换为宽格式,重新排列列顺序以返回预期结果:
library(data.table)
cols <- c("Combination", "Ag")
melt(setDT(DF), measure.vars = patterns("Combination", "[A-D][.]Ag"),
value.name = cols)[
, variable := forcats::lvls_revalue(variable, LETTERS[1:4])][
, .SD[which.max(Ag)], by = .(rs_id, Code, variable)][
, dcast(.SD, rs_id + Code ~ variable, value.var = cols)][
, setcolorder(.SD, c(1:2, as.vector(outer(c(0, 4), 3:6, "+"))))]
rs_id Code Combination_A Ag_A Combination_B Ag_B Combination_C Ag_C Combination_D Ag_D 1: rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/03:04 6 1:01/1:01 1 2: rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1 3: rs_1 2 1:01/24:02 3 13:02/8:01 1 06:02/06:02 3 1:01/4:04 3 4: rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1 5: rs_2 1 11:01/2:01 4 15:01/15:01 1 06:02/3:03 1 1:01/7:01 2 6: rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5
OP要求解释最后一个链式data.table
表达式setcolorder(.SD, c(1:2, as.vector(outer(c(0, 4), 3:6, "+"))))
。
此表达式按引用对结果的列进行排序,即不进行复制。在重塑多个value.var
时,列按value.var
分组:
melt(setDT(DF), measure.vars = patterns("Combination", "[A-D][.]Ag"),
value.name = cols)[
, variable := forcats::lvls_revalue(variable, LETTERS[1:4])][
, .SD[which.max(Ag)], by = .(rs_id, Code, variable)][
, dcast(.SD, rs_id + Code ~ variable, value.var = cols)]
rs_id Code Combination_A Combination_B Combination_C Combination_D Ag_A Ag_B Ag_C Ag_D 1: rs_1 0 1:01/11:01 13:02/49:01 03:04/03:04 1:01/1:01 2 2 6 1 2: rs_1 1 1:01/2:01 13:02/57:01 03:04/7:01 1:01/3:01 6 1 2 1 3: rs_1 2 1:01/24:02 13:02/8:01 06:02/06:02 1:01/4:04 3 1 3 3 4: rs_2 0 1:01/3:01 14:01/7:02 06:02/2:02: 1:01/4:07 1 1 1 1 5: rs_2 1 11:01/2:01 15:01/15:01 06:02/3:03 1:01/7:01 4 1 1 2 6: rs_2 2 11:01/25:01 15:01/44:02 06:02/4:01 10:01/3:01 1 2 1 5
而OP期望输出按variable
分组。所以所需的列顺序是
c(1, 2, 3, 7, 4, 8, 5, 9, 6, 10)
。
1
和2
表示id.var
列。 as.vector(outer(c(0, 4), 3:6, "+")))
只是一种保存输入3, 7, 4, 8, 5, 9, 6, 10
的方法。
outer(c(0, 4), 3:6, "+")
[,1] [,2] [,3] [,4] [1,] 3 4 5 6 [2,] 7 8 9 10
as.vector(outer(c(0, 4), 3:6, "+"))
[1] 3 7 4 8 5 9 6 10
可以进一步简化代码。由于as.vector()
将数组转换为向量,因此c()
内部不需要调用c()
。所以,而不是
c(1:2, as.vector(outer(c(0, 4), 3:6, "+")))
我们可以写
c(1:2, outer(c(0, 4), 3:6, "+"))
请注意,我已完成最后两列的缺失列标题。
library(data.table)
DF <- fread(
"rs_id Code Combination_Ag A.Ag Combination_Bg B.Ag Combination_Cg C.Ag Combination_Dg D.Ag
rs_1 0 1:01/1:01 1 13:02/13:02 1 03:04/03:04 6 1:01/1:01 1
rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/15:02 1 1:01/15:01 1
rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1
rs_1 2 1:01/2:05: 1 13:02/8:01 1 06:02/06:02 3 1:01/4:04 1
rs_1 2 1:01/24:02 3 14:01/14:02 1 06:02/15:02 1 1:01/4:04 3
rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1
rs_2 1 1:01/31:01 1 15:01/15:01 1 06:02/3:03 1 1:01/7:01 2
rs_2 1 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1 10:01/14:01 1
rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5"
)