我对以下数据表有疑问:
DT <- data.table(
A = c(rep("aa",2),rep("bb",2),rep("aa",2)),
B = c(rep("H",2),rep("Na",2),rep("H",2)),
C = c(1,1,1,1,1,2),
Conc = c(1.5,5,5,10,10,10),
Area =c(100.25,500,1089,6000.02,1200,10.564),
Area_UT =c(90.54,488,1010,5999,1099,8)
)
我想先用A对那个data.table进行分组。然后,我想通过计算变量Conc和Area_T与Area_UT的比率,将该组中的每一行与该组中的其他行进行比较。
因此组成的小组之一将是:
DT_sub1 <- data.table(
A = c("aa","aa","aa","aa"),
B = c("H","H","H","H"),
C = c(1,1,1,2),
Conc = c(1.5,5,10,10),
Area_T = c(100.25,500,1200,10.564),
Area_UT = c(90.54,488,1099,8)
)
和另一个:
DT_sub2 <- data.table(
A = c("bb","bb"),
B = c("Na","Na"),
C = c(1,1),
Conc = c(5,10),
Area_T = c(1089,6000.02),
Area_UT = c(1010,5999)
)
(我真的不需要那些组成的子组(DT_sub1
和DT_sub2
)作为输出,我只是想弄清楚我想分组的内容)
然后,我想通过计算变量Conc和Area_T和Area_UT中的所有可能比率来比较该组中的每一行与同一组中的其他行。但是,这样做每个比率将形成两次(较低的数字/较高的数字和较高的数字/较低的数字)。 因此,我只想在Area_T1 / Area_T2 <1时在两个子组行之间进行比较。在输出矩阵中,应该将两个比较的行彼此“复制”在一起,然后按照形成的比率进行可变。 因此,对于数据表DT,输出应如下所示:
DT_output <- data.table(
A_1 = c("aa","aa","aa","aa","aa","aa","bb"),
B_1 = c("H","H","H","H","H","H","Na"),
C_1 = c(1,1,2,1,2,2,1),
Conc_1 = c(1.5,1.5,10,5,10,10,5),
Area_T_1 = c(100.25,100.25,10.564,500,10.564,10.564,1089),
Area_UT_1 = c(90.54,90.54,8,488,8,8,1010),
A_2 = c("aa","aa","aa","aa","aa","aa","bb"),
B_2 = c("H","H","H","H","H","H","Na"),
C_2 = c(1,1,1,1,1,1,1),
Conc_2 = c(5,10,1.5,10,5,10,10),
Area_T_2 = c(500,1200,100.25,1200,500,1200,6000.02),
Area_UT_2 = c(488,1099,90.54,1099,488,1099,5999),
R_Conc = c(0.3,0.15,6.66666666666667,0.5,2,1,0.5),
R_Area_T = c(0.2005,0.0835416666666667,0.105376558603491,0.416666666666667,0.021128,0.00880333333333333,0.181499395002017),
R_Area_UT = c(0.185532786885246,0.0823839854413103,0.0883587364700685,0.444040036396724,0.0163934426229508,0.00727934485896269,0.168361393565594)
)
在基数R data.table
或dplyr
中是否有解决此问题的方法?
非常感谢!
Yasel
答案 0 :(得分:2)
使用data.table的非等额联接非常适合此问题:
#copy Area column to be used as joining key as it will be overwritten after join
DT[, Aid := Area]
newcols <- c("R_Conc","R_Area_T","R_Area_UT")
#perform non-equi inner join
output <- DT[DT, on=.(A, Aid<Aid), nomatch=0L][,
#calculate ratios and update by reference
(newcols) := .(Conc/i.Conc, Area/i.Area, Area_UT/i.Area_UT)]
setorder(output, A, Conc, Area)[]
的输出:
A B C Conc Area Area_UT Aid i.B i.C i.Conc i.Area i.Area_UT R_Conc R_Area_T R_Area_UT
1: aa H 1 1.5 100.250 90.54 500.00 H 1 5.0 500.00 488.00 0.300000 0.200500000 0.185532787
2: aa H 1 1.5 100.250 90.54 1200.00 H 1 10.0 1200.00 1099.00 0.150000 0.083541667 0.082383985
3: aa H 1 5.0 500.000 488.00 1200.00 H 1 10.0 1200.00 1099.00 0.500000 0.416666667 0.444040036
4: aa H 2 10.0 10.564 8.00 100.25 H 1 1.5 100.25 90.54 6.666667 0.105376559 0.088358736
5: aa H 2 10.0 10.564 8.00 500.00 H 1 5.0 500.00 488.00 2.000000 0.021128000 0.016393443
6: aa H 2 10.0 10.564 8.00 1200.00 H 1 10.0 1200.00 1099.00 1.000000 0.008803333 0.007279345
7: bb Na 1 5.0 1089.000 1010.00 6000.02 Na 1 10.0 6000.02 5999.00 0.500000 0.181499395 0.168361394
与setorder(DT_output, A_1, Conc_1, Area_T_1)[]
相比:
A_1 B_1 C_1 Conc_1 Area_T_1 Area_UT_1 A_2 B_2 C_2 Conc_2 Area_T_2 Area_UT_2 R_Conc R_Area_T R_Area_UT
1: aa H 1 1.5 100.250 90.54 aa H 1 5.0 500.00 488.00 0.300000 0.200500000 0.185532787
2: aa H 1 1.5 100.250 90.54 aa H 1 10.0 1200.00 1099.00 0.150000 0.083541667 0.082383985
3: aa H 1 5.0 500.000 488.00 aa H 1 10.0 1200.00 1099.00 0.500000 0.416666667 0.444040036
4: aa H 2 10.0 10.564 8.00 aa H 1 1.5 100.25 90.54 6.666667 0.105376559 0.088358736
5: aa H 2 10.0 10.564 8.00 aa H 1 5.0 500.00 488.00 2.000000 0.021128000 0.016393443
6: aa H 2 10.0 10.564 8.00 aa H 1 10.0 1200.00 1099.00 1.000000 0.008803333 0.007279345
7: bb Na 1 5.0 1089.000 1010.00 bb Na 1 10.0 6000.02 5999.00 0.500000 0.181499395 0.168361394
如果需要,可以使用output
更新data.table::setnames
的列名。基本上,i.
对应于您的_2
。
答案 1 :(得分:1)
进行笛卡尔合并然后进行过滤的另一种方法是使用唯一的组合:
newcols <- c("R_Conc","R_Area_T","R_Area_UT")
res <- DT[,{
unique_comb=combn(.SD[,.I],2)
data.table(.SD[unique_comb[1,]],.SD[unique_comb[2,]],check.names = T)[,(newcols) := .(Conc/Conc.1, Area/Area.1, Area_UT/Area_UT.1)]
},A]