如何快速汇总和汇总数据?

时间:2011-10-11 07:09:03

标签: r plyr data.table

我有一个数据集,其标题如下:

PID Time Site Rep Count

我想为每个Count

RepPID x Time x Site combo

在生成的data.frame上,我想获得Count组合的PID x Time x Site的平均值。

当前功能如下:

dummy <- function (data)
{
A<-aggregate(Count~PID+Time+Site+Rep,data=data,function(x){sum(na.omit(x))})
B<-aggregate(Count~PID+Time+Site,data=A,mean)
return (B)
}

这很慢(原始data.frame是510000 20)。有没有办法加快plyr的速度?

2 个答案:

答案 0 :(得分:21)

您应该查看包data.table,以便在大型数据帧上实现更快的聚合操作。对于您的问题,解决方案将如下所示:

library(data.table)
data_t = data.table(data_tab)
ans = data_t[,list(A = sum(count), B = mean(count)), by = 'PID,Time,Site']

答案 1 :(得分:7)

让我们看看data.table的速度有多快,并与使用dplyr进行比较。这在dplyr中大致是这样做的。

data %>% group_by(PID, Time, Site, Rep) %>%
    summarise(totalCount = sum(Count)) %>%
    group_by(PID, Time, Site) %>% 
    summarise(mean(totalCount))

或许这可能取决于问题的确切解释:

    data %>% group_by(PID, Time, Site) %>%
        summarise(totalCount = sum(Count), meanCount = mean(Count)  

以下是这些替代方案的完整示例,而不是@Ramnath提出的答案和评论中提出的@David Arenburg,我认为这相当于第二个dplyr语句。

nrow <- 510000
data <- data.frame(PID = sample(letters, nrow, replace = TRUE), 
                   Time = sample(letters, nrow, replace = TRUE),
                   Site = sample(letters, nrow, replace = TRUE),
                   Rep = rnorm(nrow),
                   Count = rpois(nrow, 100))


library(dplyr)
library(data.table)

Rprof(tf1 <- tempfile())
ans <- data %>% group_by(PID, Time, Site, Rep) %>%
    summarise(totalCount = sum(Count)) %>%
    group_by(PID, Time, Site) %>% 
    summarise(mean(totalCount))
Rprof()
summaryRprof(tf1)  #reports 1.68 sec sampling time

Rprof(tf2 <- tempfile())
ans <- data %>% group_by(PID, Time, Site, Rep) %>%
    summarise(total = sum(Count), meanCount = mean(Count)) 
Rprof()
summaryRprof(tf2)  # reports 1.60 seconds

Rprof(tf3 <- tempfile())
data_t = data.table(data)
ans = data_t[,list(A = sum(Count), B = mean(Count)), by = 'PID,Time,Site']
Rprof()
summaryRprof(tf3)  #reports 0.06 seconds

Rprof(tf4 <- tempfile())
ans <- setDT(data)[,.(A = sum(Count), B = mean(Count)), by = 'PID,Time,Site']
Rprof()
summaryRprof(tf4)  #reports 0.02 seconds

数据表方法更快,setDT甚至更快!