在数据表的子集中迭代计算

时间:2018-12-03 20:22:57

标签: r data.table

我对以下数据表有疑问:

DT <- data.table(
  A = c(rep("aa",2),rep("bb",2),rep("aa",2)),
  B = c(rep("H",2),rep("Na",2),rep("H",2)),
  C = c(1,1,1,1,1,2),
  Conc = c(1.5,5,5,10,10,10),
  Area =c(100.25,500,1089,6000.02,1200,10.564),
  Area_UT =c(90.54,488,1010,5999,1099,8)
)

我想先用A对那个data.table进行分组。然后,我想通过计算变量Conc和Area_T与Area_UT的比率,将该组中的每一行与该组中的其他行进行比较。

因此组成的小组之一将是:

DT_sub1 <- data.table(
  A = c("aa","aa","aa","aa"),
  B = c("H","H","H","H"),
  C = c(1,1,1,2),
  Conc = c(1.5,5,10,10),
  Area_T = c(100.25,500,1200,10.564),
  Area_UT = c(90.54,488,1099,8)
)

和另一个:

DT_sub2 <- data.table(
  A = c("bb","bb"),
  B = c("Na","Na"),
  C = c(1,1),
  Conc = c(5,10),
  Area_T = c(1089,6000.02),
  Area_UT = c(1010,5999)
)

(我真的不需要那些组成的子组(DT_sub1DT_sub2)作为输出,我只是想弄清楚我想分组的内容)

然后,我想通过计算变量Conc和Area_T和Area_UT中的所有可能比率来比较该组中的每一行与同一组中的其他行。但是,这样做每个比率将形成两次(较低的数字/较高的数字和较高的数字/较低的数字)。 因此,我只想在Area_T1 / Area_T2 <1时在两个子组行之间进行比较。在输出矩阵中,应该将两个比较的行彼此“复制”在一起,然后按照形成的比率进行可变。 因此,对于数据表DT,输出应如下所示:

DT_output <- data.table(
  A_1 = c("aa","aa","aa","aa","aa","aa","bb"),
  B_1 = c("H","H","H","H","H","H","Na"),
  C_1 = c(1,1,2,1,2,2,1),
  Conc_1 = c(1.5,1.5,10,5,10,10,5),
  Area_T_1 = c(100.25,100.25,10.564,500,10.564,10.564,1089),
  Area_UT_1 = c(90.54,90.54,8,488,8,8,1010),
  A_2 = c("aa","aa","aa","aa","aa","aa","bb"),
  B_2 = c("H","H","H","H","H","H","Na"),
  C_2 = c(1,1,1,1,1,1,1),
  Conc_2 = c(5,10,1.5,10,5,10,10),
  Area_T_2 = c(500,1200,100.25,1200,500,1200,6000.02),
  Area_UT_2 = c(488,1099,90.54,1099,488,1099,5999),
  R_Conc = c(0.3,0.15,6.66666666666667,0.5,2,1,0.5),
  R_Area_T = c(0.2005,0.0835416666666667,0.105376558603491,0.416666666666667,0.021128,0.00880333333333333,0.181499395002017),
  R_Area_UT = c(0.185532786885246,0.0823839854413103,0.0883587364700685,0.444040036396724,0.0163934426229508,0.00727934485896269,0.168361393565594)
)

在基数R data.tabledplyr中是否有解决此问题的方法?

非常感谢!

Yasel

2 个答案:

答案 0 :(得分:2)

使用的非等额联接非常适合此问题:

#copy Area column to be used as joining key as it will be overwritten after join
DT[, Aid := Area]

newcols <- c("R_Conc","R_Area_T","R_Area_UT")

#perform non-equi inner join
output <- DT[DT, on=.(A, Aid<Aid), nomatch=0L][, 
    #calculate ratios and update by reference
    (newcols) := .(Conc/i.Conc, Area/i.Area, Area_UT/i.Area_UT)]

setorder(output, A, Conc, Area)[]的输出:

    A  B C Conc     Area Area_UT     Aid i.B i.C i.Conc  i.Area i.Area_UT   R_Conc    R_Area_T   R_Area_UT
1: aa  H 1  1.5  100.250   90.54  500.00   H   1    5.0  500.00    488.00 0.300000 0.200500000 0.185532787
2: aa  H 1  1.5  100.250   90.54 1200.00   H   1   10.0 1200.00   1099.00 0.150000 0.083541667 0.082383985
3: aa  H 1  5.0  500.000  488.00 1200.00   H   1   10.0 1200.00   1099.00 0.500000 0.416666667 0.444040036
4: aa  H 2 10.0   10.564    8.00  100.25   H   1    1.5  100.25     90.54 6.666667 0.105376559 0.088358736
5: aa  H 2 10.0   10.564    8.00  500.00   H   1    5.0  500.00    488.00 2.000000 0.021128000 0.016393443
6: aa  H 2 10.0   10.564    8.00 1200.00   H   1   10.0 1200.00   1099.00 1.000000 0.008803333 0.007279345
7: bb Na 1  5.0 1089.000 1010.00 6000.02  Na   1   10.0 6000.02   5999.00 0.500000 0.181499395 0.168361394

setorder(DT_output, A_1, Conc_1, Area_T_1)[]相比:

   A_1 B_1 C_1 Conc_1 Area_T_1 Area_UT_1 A_2 B_2 C_2 Conc_2 Area_T_2 Area_UT_2   R_Conc    R_Area_T   R_Area_UT
1:  aa   H   1    1.5  100.250     90.54  aa   H   1    5.0   500.00    488.00 0.300000 0.200500000 0.185532787
2:  aa   H   1    1.5  100.250     90.54  aa   H   1   10.0  1200.00   1099.00 0.150000 0.083541667 0.082383985
3:  aa   H   1    5.0  500.000    488.00  aa   H   1   10.0  1200.00   1099.00 0.500000 0.416666667 0.444040036
4:  aa   H   2   10.0   10.564      8.00  aa   H   1    1.5   100.25     90.54 6.666667 0.105376559 0.088358736
5:  aa   H   2   10.0   10.564      8.00  aa   H   1    5.0   500.00    488.00 2.000000 0.021128000 0.016393443
6:  aa   H   2   10.0   10.564      8.00  aa   H   1   10.0  1200.00   1099.00 1.000000 0.008803333 0.007279345
7:  bb  Na   1    5.0 1089.000   1010.00  bb  Na   1   10.0  6000.02   5999.00 0.500000 0.181499395 0.168361394

如果需要,可以使用output更新data.table::setnames的列名。基本上,i.对应于您的_2

答案 1 :(得分:1)

进行笛卡尔合并然后进行过滤的另一种方法是使用唯一的组合:

newcols <- c("R_Conc","R_Area_T","R_Area_UT")
res <- DT[,{
  unique_comb=combn(.SD[,.I],2)
  data.table(.SD[unique_comb[1,]],.SD[unique_comb[2,]],check.names = T)[,(newcols) := .(Conc/Conc.1, Area/Area.1, Area_UT/Area_UT.1)]
},A]