通过在R中的数据帧内对两个变量进行分组来实现复数和函数

时间:2017-08-21 09:32:46

标签: r

对于以下Dataframe条目:

DF <- data.frame(Major=c("MATH","MATH","MATH","MLSP","MLSP","MLSP","BIOL","BIOL","BIOL","PSHY","PSHY","PSHY"), Age=(c(2,3,4,2,3,4,2,3,4,2,3,4)), 
                 MJR_1=c("BIOL","PSHY","AGBU","MATH","PSHY",0,"MLSP","MATH",0,0,"MATH","MLSP"), TRF_MJR_1=(c(7,2,2,3,2,0,3,2,0,0,2,2)),
                 MJR_2=c("PSHY","BIOL",0,"BIOL","MATH",0,"MATH","PSHY",0,0,"MLSP","BIOL"), TRF_MJR_2=(c(3,1,0,2,1,0,2,4,0,0,1,2)),
                 MJR_3=c(0,0,0,0,"BIOL",0,0,0,0,0,0,0), TRF_MJR_3=(c(0,0,0,0,1,0,0,0,0,0,0,0)))

我们得到以下数据框:

   Major  Age MJR_1 TRF_MJR_1 MJR_2 TRF_MJR_2 MJR_3 TRF_MJR_3
1   MATH   2  BIOL         7  PSHY         3     0         0
2   MATH   3  PSHY         2  BIOL         1     0         0
3   MATH   4  AGBU         2     0         0     0         0
4   MLSP   2  MATH         3  BIOL         2     0         0
5   MLSP   3  PSHY         2  MATH         1  BIOL         1
6   MLSP   4     0         0     0         0     0         0
7   BIOL   2  MLSP         3  MATH         2     0         0
8   BIOL   3  MATH         2  PSHY         4     0         0
9   BIOL   4     0         0     0         0     0         0
10  PSHY   2     0         0     0         0     0         0
11  PSHY   3  MATH         2  MLSP         1     0         0
12  PSHY   4  MLSP         2  BIOL         2     0         0

好吧,我需要让下面的输出表具有一个名为“TRF_IN - Transferred IN”列主要的SUM函数,它将每个转移的专业(TRF_MJR_1,TRF_MJR_2等)的数量添加到适当的分组类别中前两栏中的(专业和年龄);虽然“主要”类别根据MJR_1,MJR_2等变化,但如下所示。

我感谢任何帮助,以避免多个“合并”或“ddply”函数,因为实际文件很大并且有很多变量..

   Major Age TRF_IN_SUM
 1:  MATH   2          5
 2:  MATH   3          5
 3:  MATH   4          0
 4:  MLSP   2          3
 5:  MLSP   3          1
 6:  MLSP   4          2
 7:  BIOL   2          9
 8:  BIOL   3          2
 9:  BIOL   4          2
10:  PSHY   2          3
11:  PSHY   3          8
12:  PSHY   4          0    
**13:  AGBU   4          2**

输出表的说明:

Row1: Math major with Age 2:  
TRF_IN = "3" from TRF_MJR_1 in Row(4) having MJR_1= Math and Age =2 
+ 
TRF_IN = "2" from  TRF_MJR_2 in Row(7) having MJR_2= Math and Age =2

Row2: Math major with Age 3 :  
TRF_IN = "1" from TRF_MJR_2 in Row(5) having MJR_2= Math and Age =3 
+ 
TRF_IN = "2" from  TRF_MJR_1 in Row(8) having MJR_1= Math and Age =3
+
TRF_IN = "2" from  TRF_MJR_1 in Row(11) having MJR_1= Math and Age =3

1 个答案:

答案 0 :(得分:1)

melt()包中的data.table函数可以同时重塑多个度量列,这是必需的。

变式1:

library(data.table)
# reshape from wide to long format
melt(setDT(DF), id.vars = c("Major", "Age"), 
     measure.vars = patterns("^MJR_", "^TRF_MJR_"))[
       # omit null entries
       value1 != "0" & value2 != 0L][
         # aggregate
         , .(TRF_IN_SUM = sum(value2)), 
         keyby = .(Major = value1, Age)][
           # right join with first two columns of wide data set
           DF[, 1:2], on = c("Major", "Age")][
             # replace NA by 0
             is.na(TRF_IN_SUM), TRF_IN_SUM := 0L][]
    Major Age TRF_IN_SUM
 1:  MATH   2          5
 2:  MATH   3          5
 3:  MATH   4          0
 4:  MLSP   2          3
 5:  MLSP   3          1
 6:  MLSP   4          2
 7:  BIOL   2          9
 8:  BIOL   3          2
 9:  BIOL   4          2
10:  PSHY   2          3
11:  PSHY   3          8
12:  PSHY   4          0

编辑1:通过右键连接原始(宽)数据的前两列,结果具有相同的行数和顺序。表示缺失数据的NA被替换为0.

警告:正如OP指出的那样,MJR列中尚未包含的Major中的任何值都不会出现在结果中,例如AGBU。所以不建议这样做。

变式2:

以前使用melt()Frank's approach使用CJ(),但通过使用因子维护Major列和&#34;的给定顺序来增强此功能。 prettifies&#34;结果。请注意,使用了方便的forcats包。

library(data.table)
library(forcats)
setDT(DF)[
  # make sure factor levels are in order of occurence
  , Major := fct_inorder(Major)][
    # reshape wide to long with multiple measures columns 
    , melt(.SD, measure.vars = patterns("^MJR", "^TRF"), 
           value.name = c("MJR", "TRF"))][
             # omit null entries
             MJR != "0"][
               # unify factor levels with levels of Major in lead
               , c("Major", "MJR") := fct_unify(.(Major, factor(MJR)))][
                 # use cross join to create all combinations of MJR and Age,
                 # right join with results 
                 CJ(MJR, Age, unique = TRUE), on = .(MJR = V1, Age = V2), 
                 # aggregate by join parameters
                 .(TRF_IN_SUM = sum(TRF, na.rm = TRUE)), by = .EACHI]
     MJR Age TRF_IN_SUM
 1: MATH   2          5
 2: MATH   3          5
 3: MATH   4          0
 4: MLSP   2          3
 5: MLSP   3          1
 6: MLSP   4          2
 7: BIOL   2          9
 8: BIOL   3          2
 9: BIOL   4          2
10: PSHY   2          3
11: PSHY   3          8
12: PSHY   4          0
13: AGBU   2          0
14: AGBU   3          0
15: AGBU   4          0

现在,结果包括AGBU,同时显示所有MJRAge组合并保留Major中的原始订单。

但是,如果Major中的条目没有出现在任何MJR列中,则这仍然可能不完美。为了涵盖这种情况,完整的联接,即merge()all = TRUE更适合。