Question

我有一个如下所示的数据框：

set.seed(50)
data.frame(distance=c(rep("long", 5), rep("short", 5)),
           year=rep(2002:2006),
           mean.length=rnorm(10))

   distance year mean.length
1      long 2002  0.54966989
2      long 2003 -0.84160374
3      long 2004  0.03299794
4      long 2005  0.52414971
5      long 2006 -1.72760411
6     short 2002 -0.27786453
7     short 2003  0.36082844
8     short 2004 -0.59091244
9     short 2005  0.97559055
10    short 2006 -1.44574995

我需要计算每年mean.length和long之间short之间的差异。什么是最快的方式呢？

Answer 1

这是使用plyr的一种方式：

set.seed(50)
df <- data.frame(distance=c(rep("long", 5),rep("short", 5)),
                 year=rep(2002:2006),
                 mean.length=rnorm(10))

library(plyr)
aggregation.fn <- function(df) {
  data.frame(year=df$year[1],
             diff=(df$mean.length[df$distance == "long"] -
                   df$mean.length[df$distance == "short"]))}
new.df <- ddply(df, "year", aggregation.fn)

给你

> new.df
  year       diff
1 2002  0.8275344
2 2003 -1.2024322
3 2004  0.6239104
4 2005 -0.4514408
5 2006 -0.2818542

第二种方式

df <- df[order(df$year, df$distance), ]
n <- dim(df)[1]
df$new.year <- c(1, df$year[2:n] != df$year[1:(n-1)])
df$diff <- c(-diff(df$mean.length), NA)
df$diff[!df$new.year] <- NA
new.df.2 <- df[!is.na(df$diff), c("year", "diff")]

all(new.df.2 == new.df)  # True

Answer 2

像这样使用tapply()和apply()：

apply(
  with(x, tapply(mean.length, list(year, distance), FUN=mean)),
  1, 
  diff
)

      2002       2003       2004       2005       2006 
-0.8275344  1.2024322 -0.6239104  0.4514408  0.2818542

这是有效的，因为tapply按year和distance创建表格摘要：

with(x, tapply(mean.length, list(year, distance), FUN=mean))

            long      short
2002  0.54966989 -0.2778645
2003 -0.84160374  0.3608284
2004  0.03299794 -0.5909124
2005  0.52414971  0.9755906
2006 -1.72760411 -1.4457499

Answer 3

由于您似乎已配对值且data.frame已订购，您可以执行以下操作：

res <- with(DF, mean.length[distance=="long"]-mean.length[distance=="short"])
names(res) <- unique(DF$year)

#     2002       2003       2004       2005       2006 
#0.8275344 -1.2024322  0.6239104 -0.4514408 -0.2818542

这应该非常快，但不像其他答案那样安全，因为它依赖于假设。

Answer 4

您已经收到了一些很好的答案来计算手头的具体问题。您可以考虑将数据重新整理为宽格式。这有两个选择：

reshape(df, direction = "wide", idvar = "year", timevar = "distance")
#---
  year mean.length.long mean.length.short
1 2002       0.54966989        -0.2778645
2 2003      -0.84160374         0.3608284
3 2004       0.03299794        -0.5909124
4 2005       0.52414971         0.9755906
5 2006      -1.72760411        -1.4457499

#package reshape2 is probably easier to use.
library(reshape2)
dcast(year ~ distance, data = df)
#---
  year        long      short
1 2002  0.54966989 -0.2778645
2 2003 -0.84160374  0.3608284
3 2004  0.03299794 -0.5909124
4 2005  0.52414971  0.9755906
5 2006 -1.72760411 -1.4457499

您现在可以轻松计算新统计信息。

计算数据帧的差异

4 个答案: