我已经编写了以下内容,它将汇总输入数据集中的目标列,并包含每个其他列的部分和(或汇总或任何优选的白话)。
这样可以正常工作,但是有一个不受欢迎的嵌套for
循环,我想删除它以支持更多“功能”方法。我已经尝试了这一点,但是尽管阅读和练习不止一点,但在涉及各种apply
和/或dplyr
时,我仍处于非grokkery 状态功能
很可能我所做的一切都是错的;例如如果最终解决方案不需要它,那么为循环准备的设置可能是不必要的......基本上我只是希望在给定提供的输入时生成预期的输出...
无论如何,这是代码:
# dummy data -- assume this is given
#######################################################################
df1 <- c("AA","B","AA","B","AA","B","AA","B","AA","B","AA","B",
"M","M","N","N","M","M","N","N","M","M","N","N",
"X","X","X","X","Y","Y","Y","Y","Z","Z","Z","Z",
2,3,4,4,2,3,5,4,3,2,5,4)
dim(df1) <- c(12,4)
colnames(df1) <- c("f1","f2","f3","cnt")
df1 <- as.data.frame(df1,stringsAsFactors=F)
df1$cnt <- as.integer(df1$cnt)
#######################################################################
library(data.table)
# some hard-coded variables...
anyStr <- "(any)" # this string cannot appear in df1
targetColName <- "cnt" # name of the column being summed from df1
outputColName <- "sum" # name of our output column
# grab names of only the columns we're going after... (just do everything but the target)
colsToSummarize = (colnames(df1)[!colnames(df1) %in% list(targetColName)])
# create a data table of just the unique values for each of those columns...
df2 <- lapply(colsToSummarize, function(x) { unique(df1[,x])})
df2 <- as.data.table(df2)
# add a dummy row that basically means "any value" ...
# this string cannot otherwise be present in the data...
df2 <- rbind(df2,as.data.table(t(rep(anyStr,length(df2)))))
colnames(df2) <- c(colsToSummarize)
# expand df2 to generate all possible settings found in df1...
df2 <- unique(expand.grid(df2))
rownames(df2)<-NULL
# do all the sums... there's probably a clever way to do this using "apply" functions...
df2[,eval(outputColName)] <- 0
for (i2 in 1:nrow(df2)) {
for (i1 in 1:nrow(df1)) {
isMatch = T
for (j in colsToSummarize) {
if ((df2[i2,eval(j)]!=anyStr) & (df1[i1,eval(j)]!=df2[i2,eval(j)])) {
isMatch = F
break
}
}
if (isMatch) {
df2[i2,eval(outputColName)] = df2[i2,eval(outputColName)] + df1[i1,eval(targetColName)]
}
}
}
因此,样本虚拟数据如下所示:
> df1
f1 f2 f3 cnt
1 AA M X 2
2 B M X 3
3 AA N X 4
4 B N X 4
5 AA M Y 2
6 B M Y 3
7 AA N Y 5
8 B N Y 4
9 AA M Z 3
10 B M Z 2
11 AA N Z 5
12 B N Z 4
......和预期的输出:
> df2
f1 f2 f3 sum
1 AA M X 2
2 B M X 3
3 (any) M X 5
4 AA N X 4
5 B N X 4
6 (any) N X 8
7 AA (any) X 6
8 B (any) X 7
9 (any) (any) X 13
10 AA M Y 2
11 B M Y 3
12 (any) M Y 5
13 AA N Y 5
14 B N Y 4
15 (any) N Y 9
16 AA (any) Y 7
17 B (any) Y 7
18 (any) (any) Y 14
19 AA M Z 3
20 B M Z 2
21 (any) M Z 5
22 AA N Z 5
23 B N Z 4
24 (any) N Z 9
25 AA (any) Z 8
26 B (any) Z 6
27 (any) (any) Z 14
28 AA M (any) 7
29 B M (any) 8
30 (any) M (any) 15
31 AA N (any) 14
32 B N (any) 12
33 (any) N (any) 26
34 AA (any) (any) 21
35 B (any) (any) 20
36 (any) (any) (any) 41
当然,我的输出基本相同; (例如NA或空格或其他而不是“(任何)”,行/列的顺序并不重要,等等......)
除此之外:这与SQL group by with rollup
不完全相同,因为它提供了所有排列而不是基于group by
子句中变量顺序的子集...如果读取此内容的人想要该子集,他们只需要删除包含意外“(任意)”值的行。
答案 0 :(得分:2)
您可以将addmargins()与ftable()结合使用。 首先是表格,其中总结了群组的cnt:
table1 <- xtabs(cnt ~f1 + f2 + f3, data= df1)
> table1
, , f3 = X
f2
f1 M N
AA 2 4
B 3 4
, , f3 = Y
f2
f1 M N
AA 2 5
B 3 4
, , f3 = Z
f2
f1 M N
AA 3 5
B 2 4
然后使用addmargins()计算部分和
tablle2 <- addmargins(table1)
> tablle2
, , f3 = X
f2
f1 M N Sum
AA 2 4 6
B 3 4 7
Sum 5 8 13
, , f3 = Y
f2
f1 M N Sum
AA 2 5 7
B 3 4 7
Sum 5 9 14
, , f3 = Z
f2
f1 M N Sum
AA 3 5 8
B 2 4 6
Sum 5 9 14
, , f3 = Sum
f2
f1 M N Sum
AA 7 14 21
B 8 12 20
Sum 15 26 41
最后ftable()把它带到一个很好的形式:
table3 <- ftable(tablle2)
> table3
f3 X Y Z Sum
f1 f2
AA M 2 2 3 7
N 4 5 5 14
Sum 6 7 8 21
B M 3 3 2 8
N 4 4 4 12
Sum 7 7 6 20
Sum M 5 5 5 15
N 8 9 9 26
Sum 13 14 14 41
最后一次使用的as.data.frame是以问题中提到的形式出现的:
table4 <- as.data.frame(table3)
> table4
f1 f2 f3 Freq
1 AA M X 2
2 B M X 3
3 Sum M X 5
4 AA N X 4
5 B N X 4
6 Sum N X 8
7 AA Sum X 6
8 B Sum X 7
9 Sum Sum X 13
10 AA M Y 2
11 B M Y 3
12 Sum M Y 5
13 AA N Y 5
14 B N Y 4
15 Sum N Y 9
16 AA Sum Y 7
17 B Sum Y 7
18 Sum Sum Y 14
19 AA M Z 3
20 B M Z 2
21 Sum M Z 5
22 AA N Z 5
23 B N Z 4
24 Sum N Z 9
25 AA Sum Z 8
26 B Sum Z 6
27 Sum Sum Z 14
28 AA M Sum 7
29 B M Sum 8
30 Sum M Sum 15
31 AA N Sum 14
32 B N Sum 12
33 Sum N Sum 26
34 AA Sum Sum 21
35 B Sum Sum 20
36 Sum Sum Sum 41