我有一个如下所示的数据框:
set.seed(50)
data.frame(distance=c(rep("long", 5), rep("short", 5)),
year=rep(2002:2006),
mean.length=rnorm(10))
distance year mean.length
1 long 2002 0.54966989
2 long 2003 -0.84160374
3 long 2004 0.03299794
4 long 2005 0.52414971
5 long 2006 -1.72760411
6 short 2002 -0.27786453
7 short 2003 0.36082844
8 short 2004 -0.59091244
9 short 2005 0.97559055
10 short 2006 -1.44574995
我需要计算每年mean.length
和long
之间short
之间的差异。什么是最快的方式呢?
答案 0 :(得分:5)
这是使用plyr的一种方式:
set.seed(50)
df <- data.frame(distance=c(rep("long", 5),rep("short", 5)),
year=rep(2002:2006),
mean.length=rnorm(10))
library(plyr)
aggregation.fn <- function(df) {
data.frame(year=df$year[1],
diff=(df$mean.length[df$distance == "long"] -
df$mean.length[df$distance == "short"]))}
new.df <- ddply(df, "year", aggregation.fn)
给你
> new.df
year diff
1 2002 0.8275344
2 2003 -1.2024322
3 2004 0.6239104
4 2005 -0.4514408
5 2006 -0.2818542
第二种方式
df <- df[order(df$year, df$distance), ]
n <- dim(df)[1]
df$new.year <- c(1, df$year[2:n] != df$year[1:(n-1)])
df$diff <- c(-diff(df$mean.length), NA)
df$diff[!df$new.year] <- NA
new.df.2 <- df[!is.na(df$diff), c("year", "diff")]
all(new.df.2 == new.df) # True
答案 1 :(得分:3)
像这样使用tapply()
和apply()
:
apply(
with(x, tapply(mean.length, list(year, distance), FUN=mean)),
1,
diff
)
2002 2003 2004 2005 2006
-0.8275344 1.2024322 -0.6239104 0.4514408 0.2818542
这是有效的,因为tapply
按year
和distance
创建表格摘要:
with(x, tapply(mean.length, list(year, distance), FUN=mean))
long short
2002 0.54966989 -0.2778645
2003 -0.84160374 0.3608284
2004 0.03299794 -0.5909124
2005 0.52414971 0.9755906
2006 -1.72760411 -1.4457499
答案 2 :(得分:2)
由于您似乎已配对值且data.frame已订购,您可以执行以下操作:
res <- with(DF, mean.length[distance=="long"]-mean.length[distance=="short"])
names(res) <- unique(DF$year)
# 2002 2003 2004 2005 2006
#0.8275344 -1.2024322 0.6239104 -0.4514408 -0.2818542
这应该非常快,但不像其他答案那样安全,因为它依赖于假设。
答案 3 :(得分:1)
您已经收到了一些很好的答案来计算手头的具体问题。您可以考虑将数据重新整理为宽格式。这有两个选择:
reshape(df, direction = "wide", idvar = "year", timevar = "distance")
#---
year mean.length.long mean.length.short
1 2002 0.54966989 -0.2778645
2 2003 -0.84160374 0.3608284
3 2004 0.03299794 -0.5909124
4 2005 0.52414971 0.9755906
5 2006 -1.72760411 -1.4457499
#package reshape2 is probably easier to use.
library(reshape2)
dcast(year ~ distance, data = df)
#---
year long short
1 2002 0.54966989 -0.2778645
2 2003 -0.84160374 0.3608284
3 2004 0.03299794 -0.5909124
4 2005 0.52414971 0.9755906
5 2006 -1.72760411 -1.4457499
您现在可以轻松计算新统计信息。