我正在尝试从宽data.frames
中找到创建汇总表/ data.frames
的懒惰/简单方法。假设有一个以下data.frame,但是有更多列,因此指定列名需要很长时间:
set.seed(2)
x <- data.frame(Rep = rep(1:3, 4), Temp = c(rep(10,6), rep(20,6)),
pH = rep(c(rep(8.1, 3), rep(7.6, 3)), 2),
Var1 = rnorm(12, 5,2), Var2 = c(rnorm(6,4,1), rnorm(6,3,5)),
Var3 = rt(12, 20))
x[1:3] <- as.data.frame(apply(x[1:3], 2, function(x) as.factor(x)))
现在,我可以使用plyr
计算摘要统计信息:
(mu <- ddply(x, .(Temp, pH), numcolwise(mean)))
(std <- ddply(x, .(Temp, pH), numcolwise(sd)))
(n <- ddply(x, .(Temp, pH), numcolwise(length)))
但我无法弄清楚如何同时应用所有这些功能:
ddply(x, .(Temp, pH), numcolwise(mean, sd, length))
我当然可以合并各种摘要data.tables,但这不是一种“懒惰/简单”的方式。我正在寻找一些我可以在许多情况下应用的一般内容。这样的事情,除了应该可以使用单个函数生成:
xx <- merge(mu, std, by = c("Temp", "pH"), sotr = F)
colnames(xx) <- gsub("x", "mean", colnames(xx))
colnames(xx) <- gsub("y", "sd", colnames(xx))
xx <- merge(xx, n, by = c("Temp", "pH"), sotr = F)
colnames(xx)[(ncol(xx)-2):ncol(xx)] <-
paste0(colnames(xx)[(ncol(xx)-2):ncol(xx)], ".length")
xx <- xx[c("Temp", "pH", grep("Var1", colnames(xx), value = T),
grep("Var2", colnames(xx), value = T),
grep("Var3", colnames(xx), value = T))]
xx
Temp pH Var1.mean Var1.sd Var1.length Var2.mean Var2.sd Var2.length Var3.mean Var3.sd Var3.length
1 10 7.6 4.281195 1.352194 3 3.534447 1.652884 3 0.1529616 1.076276 3
2 10 8.1 5.583853 2.491672 3 4.116622 1.478286 3 1.1611944 1.081301 3
3 20 7.6 5.840411 1.120549 3 6.907273 8.628021 3 0.1301949 1.764201 3
4 20 8.1 6.635154 2.232262 3 8.893188 4.208087 3 0.5509202 1.187431 3
目前可以在 R 中做到吗?任何建议都将不胜感激。
答案 0 :(得分:5)
使用reshape2
和plyr
执行此操作的一种方法。但是你得到的变量是行而不是列:
library(reshape2)
library(plyr)
md <- melt(x[,-1], id.vars=c("Temp","pH"))
ddply(md, c("Temp", "pH", "variable"), summarize, mean=mean(value), sd=sd(value))
给出了:
Temp pH variable mean sd
1 10 7.6 Var1 4.2811952 1.352194
2 10 7.6 Var2 3.5344474 1.652884
3 10 7.6 Var3 0.1529616 1.076276
4 10 8.1 Var1 5.5838533 2.491672
5 10 8.1 Var2 4.1166215 1.478286
6 10 8.1 Var3 1.1611944 1.081301
7 20 7.6 Var1 5.8404110 1.120549
8 20 7.6 Var2 6.9072734 8.628021
9 20 7.6 Var3 0.1301949 1.764201
10 20 8.1 Var1 6.6351538 2.232262
11 20 8.1 Var2 8.8931884 4.208087
12 20 8.1 Var3 0.5509202 1.187431
如果您希望结果的格式较宽,可以使用reshape
:
md <- melt(x[,-1], id.vars=c("Temp","pH"))
result <- ddply(md, c("Temp", "pH", "variable"), summarize, mean=mean(value), sd=sd(value))
reshape(result, idvar=c("Temp","pH"), timevar="variable",direction="wide")
Temp pH mean.Var1 sd.Var1 mean.Var2 sd.Var2 mean.Var3 sd.Var3
1 10 7.6 4.281195 1.352194 3.534447 1.652884 0.1529616 1.076276
4 10 8.1 5.583853 2.491672 4.116622 1.478286 1.1611944 1.081301
7 20 7.6 5.840411 1.120549 6.907273 8.628021 0.1301949 1.764201
10 20 8.1 6.635154 2.232262 8.893188 4.208087 0.5509202 1.187431
答案 1 :(得分:2)
Base R aggregate
实际上可以处理这个问题,但是以一种奇怪的方式:
(temp <- aggregate(. ~ Temp + pH, x, function(y) cbind(mean(y), sd(y), length(y))))
# Temp pH Rep.1 Rep.2 Rep.3 Var1.1 Var1.2 Var1.3 Var2.1 Var2.2 Var2.3
# 1 10 7.6 2 1 3 4.281195 1.352194 3.000000 3.534447 1.652884 3.000000
# 2 20 7.6 2 1 3 5.840411 1.120549 3.000000 6.907273 8.628021 3.000000
# 3 10 8.1 2 1 3 5.583853 2.491672 3.000000 4.116622 1.478286 3.000000
# 4 20 8.1 2 1 3 6.635154 2.232262 3.000000 8.893188 4.208087 3.000000
# Var3.1 Var3.2 Var3.3
# 1 0.1529616 1.0762763 3.0000000
# 2 0.1301949 1.7642008 3.0000000
# 3 1.1611944 1.0813007 3.0000000
# 4 0.5509202 1.1874306 3.0000000
str(temp)
# 'data.frame': 4 obs. of 6 variables:
# $ Temp: Factor w/ 2 levels "10","20": 1 2 1 2
# $ pH : Factor w/ 2 levels "7.6","8.1": 1 1 2 2
# $ Rep : num [1:4, 1:3] 2 2 2 2 1 1 1 1 3 3 ...
# $ Var1: num [1:4, 1:3] 4.28 5.84 5.58 6.64 1.35 ...
# $ Var2: num [1:4, 1:3] 3.53 6.91 4.12 8.89 1.65 ...
# $ Var3: num [1:4, 1:3] 0.153 0.13 1.161 0.551 1.076 ...
请注意,当我们查看输出结构时,我们发现“Rep”,“Var1”等实际上是矩阵。因此,您可以提取它们并cbind
它们。但是,这有点单调乏味。
我不得不做一些类似的事情,我最后只是在aggregate
周围写了一个基本的包装,看起来像这样。
aggregate2 <- function(data, aggs, ids, funs = NULL, ...) {
if (identical(aggs, "."))
aggs <- setdiff(names(data), ids)
if (identical(ids, "."))
ids <- setdiff(names(data), aggs)
if (is.null(funs))
stop("Aggregation function missing")
myformula <- as.formula(
paste(sprintf("cbind(%s)", paste(aggs, collapse = ", ")),
" ~ ", paste(ids, collapse = " + ")))
temp <- aggregate(
formula = eval(myformula), data = data,
FUN = function(x) sapply(seq_along(funs),
function(z) eval(call(funs[z], quote(x)))), ...)
temp1 <- do.call(cbind, lapply(temp[-c(1:length(ids))], as.data.frame))
names(temp1) <- paste(rep(aggs, each = length(funs)), funs, sep = ".")
cbind(temp[1:length(ids)], temp1)
}
以下是将其应用于示例数据的方法。
(temp2 <- aggregate2(x, ".", c("Temp", "pH"), c("mean", "sd", "length")))
# Temp pH Rep.mean Rep.sd Rep.length Var1.mean Var1.sd Var1.length Var2.mean
# 1 10 7.6 2 1 3 4.281195 1.352194 3 3.534447
# 2 20 7.6 2 1 3 5.840411 1.120549 3 6.907273
# 3 10 8.1 2 1 3 5.583853 2.491672 3 4.116622
# 4 20 8.1 2 1 3 6.635154 2.232262 3 8.893188
# Var2.sd Var2.length Var3.mean Var3.sd Var3.length
# 1 1.652884 3 0.1529616 1.076276 3
# 2 8.628021 3 0.1301949 1.764201 3
# 3 1.478286 3 1.1611944 1.081301 3
# 4 4.208087 3 0.5509202 1.187431 3
而且,结构是我们所期望的。
str(temp2)
# 'data.frame': 4 obs. of 14 variables:
# $ Temp : Factor w/ 2 levels "10","20": 1 2 1 2
# $ pH : Factor w/ 2 levels "7.6","8.1": 1 1 2 2
# $ Rep.mean : num 2 2 2 2
# $ Rep.sd : num 1 1 1 1
# $ Rep.length : num 3 3 3 3
# $ Var1.mean : num 4.28 5.84 5.58 6.64
# $ Var1.sd : num 1.35 1.12 2.49 2.23
# $ Var1.length: num 3 3 3 3
# $ Var2.mean : num 3.53 6.91 4.12 8.89
# $ Var2.sd : num 1.65 8.63 1.48 4.21
# $ Var2.length: num 3 3 3 3
# $ Var3.mean : num 0.153 0.13 1.161 0.551
# $ Var3.sd : num 1.08 1.76 1.08 1.19
# $ Var3.length: num 3 3 3 3
如果您不想使用该函数,那么这是专门处理aggregate
输出的部分,应用于我们在本答案开头创建的“temp”对象:
temp1 <- do.call(cbind, lapply(temp[-c(1:2)], as.data.frame))
funs <- c("mean", "sd", "length")
names(temp1) <- paste(rep(setdiff(names(temp), c("pH", "Temp")),
each = length(funs)), funs, sep = ".")
cbind(temp[1:2], temp1)
do.call(data.frame,
aggregate(. ~ Temp + pH, x, function(y) cbind(mean(y), sd(y), length(y))))
这里的缺点是名称的描述性不如我分享的aggregate2
函数,但可以通过对names
的非常简单的调用来解决。