对于以下Dataframe条目:
DF <- data.frame(Major=c("MATH","MATH","MATH","MLSP","MLSP","MLSP","BIOL","BIOL","BIOL","PSHY","PSHY","PSHY"), Age=(c(2,3,4,2,3,4,2,3,4,2,3,4)),
MJR_1=c("BIOL","PSHY","AGBU","MATH","PSHY",0,"MLSP","MATH",0,0,"MATH","MLSP"), TRF_MJR_1=(c(7,2,2,3,2,0,3,2,0,0,2,2)),
MJR_2=c("PSHY","BIOL",0,"BIOL","MATH",0,"MATH","PSHY",0,0,"MLSP","BIOL"), TRF_MJR_2=(c(3,1,0,2,1,0,2,4,0,0,1,2)),
MJR_3=c(0,0,0,0,"BIOL",0,0,0,0,0,0,0), TRF_MJR_3=(c(0,0,0,0,1,0,0,0,0,0,0,0)))
我们得到以下数据框:
Major Age MJR_1 TRF_MJR_1 MJR_2 TRF_MJR_2 MJR_3 TRF_MJR_3
1 MATH 2 BIOL 7 PSHY 3 0 0
2 MATH 3 PSHY 2 BIOL 1 0 0
3 MATH 4 AGBU 2 0 0 0 0
4 MLSP 2 MATH 3 BIOL 2 0 0
5 MLSP 3 PSHY 2 MATH 1 BIOL 1
6 MLSP 4 0 0 0 0 0 0
7 BIOL 2 MLSP 3 MATH 2 0 0
8 BIOL 3 MATH 2 PSHY 4 0 0
9 BIOL 4 0 0 0 0 0 0
10 PSHY 2 0 0 0 0 0 0
11 PSHY 3 MATH 2 MLSP 1 0 0
12 PSHY 4 MLSP 2 BIOL 2 0 0
好吧,我需要让下面的输出表具有一个名为“TRF_IN - Transferred IN”列主要的SUM函数,它将每个转移的专业(TRF_MJR_1,TRF_MJR_2等)的数量添加到适当的分组类别中前两栏中的(专业和年龄);虽然“主要”类别根据MJR_1,MJR_2等变化,但如下所示。
我感谢任何帮助,以避免多个“合并”或“ddply”函数,因为实际文件很大并且有很多变量..
Major Age TRF_IN_SUM
1: MATH 2 5
2: MATH 3 5
3: MATH 4 0
4: MLSP 2 3
5: MLSP 3 1
6: MLSP 4 2
7: BIOL 2 9
8: BIOL 3 2
9: BIOL 4 2
10: PSHY 2 3
11: PSHY 3 8
12: PSHY 4 0
**13: AGBU 4 2**
输出表的说明:
Row1: Math major with Age 2:
TRF_IN = "3" from TRF_MJR_1 in Row(4) having MJR_1= Math and Age =2
+
TRF_IN = "2" from TRF_MJR_2 in Row(7) having MJR_2= Math and Age =2
Row2: Math major with Age 3 :
TRF_IN = "1" from TRF_MJR_2 in Row(5) having MJR_2= Math and Age =3
+
TRF_IN = "2" from TRF_MJR_1 in Row(8) having MJR_1= Math and Age =3
+
TRF_IN = "2" from TRF_MJR_1 in Row(11) having MJR_1= Math and Age =3
答案 0 :(得分:1)
melt()
包中的data.table
函数可以同时重塑多个度量列,这是必需的。
library(data.table)
# reshape from wide to long format
melt(setDT(DF), id.vars = c("Major", "Age"),
measure.vars = patterns("^MJR_", "^TRF_MJR_"))[
# omit null entries
value1 != "0" & value2 != 0L][
# aggregate
, .(TRF_IN_SUM = sum(value2)),
keyby = .(Major = value1, Age)][
# right join with first two columns of wide data set
DF[, 1:2], on = c("Major", "Age")][
# replace NA by 0
is.na(TRF_IN_SUM), TRF_IN_SUM := 0L][]
Major Age TRF_IN_SUM 1: MATH 2 5 2: MATH 3 5 3: MATH 4 0 4: MLSP 2 3 5: MLSP 3 1 6: MLSP 4 2 7: BIOL 2 9 8: BIOL 3 2 9: BIOL 4 2 10: PSHY 2 3 11: PSHY 3 8 12: PSHY 4 0
编辑1:通过右键连接原始(宽)数据的前两列,结果具有相同的行数和顺序。表示缺失数据的NA
被替换为0.
警告:正如OP指出的那样,MJR
列中尚未包含的Major
中的任何值都不会出现在结果中,例如AGBU
。所以不建议这样做。
以前使用melt()
但Frank's approach使用CJ()
,但通过使用因子维护Major
列和&#34;的给定顺序来增强此功能。 prettifies&#34;结果。请注意,使用了方便的forcats
包。
library(data.table)
library(forcats)
setDT(DF)[
# make sure factor levels are in order of occurence
, Major := fct_inorder(Major)][
# reshape wide to long with multiple measures columns
, melt(.SD, measure.vars = patterns("^MJR", "^TRF"),
value.name = c("MJR", "TRF"))][
# omit null entries
MJR != "0"][
# unify factor levels with levels of Major in lead
, c("Major", "MJR") := fct_unify(.(Major, factor(MJR)))][
# use cross join to create all combinations of MJR and Age,
# right join with results
CJ(MJR, Age, unique = TRUE), on = .(MJR = V1, Age = V2),
# aggregate by join parameters
.(TRF_IN_SUM = sum(TRF, na.rm = TRUE)), by = .EACHI]
MJR Age TRF_IN_SUM 1: MATH 2 5 2: MATH 3 5 3: MATH 4 0 4: MLSP 2 3 5: MLSP 3 1 6: MLSP 4 2 7: BIOL 2 9 8: BIOL 3 2 9: BIOL 4 2 10: PSHY 2 3 11: PSHY 3 8 12: PSHY 4 0 13: AGBU 2 0 14: AGBU 3 0 15: AGBU 4 0
现在,结果包括AGBU
,同时显示所有MJR
和Age
组合并保留Major
中的原始订单。
但是,如果Major
中的条目没有出现在任何MJR
列中,则这仍然可能不完美。为了涵盖这种情况,完整的联接,即merge()
和all = TRUE
更适合。