ddply时间和性能问题

时间:2016-02-24 03:51:07

标签: r performance plyr

我可以使用ddply帮助解析计时问题的根源。 ddply函数需要花费10分钟才能在一个小数据集上运行(~4MB数据帧)。

我正在尝试以下列方式运行ddply:

new_df<- ddply(old_df, .(TIC), mutate, mean_price_3yr=rollmean(price, k=3, align= "right",na.pad=T))

old_df的格式为:

   fyear TIC ebitda price
1   2000 AIR 64.367 14.00
2   2001 AIR 27.207 11.44
3   2002 AIR 30.745  4.50
4   2003 AIR 47.491  9.58
...
   fyear   TIC  ebitda price
21  2005  ADCT 159.000 17.450
22  2006  ADCT 140.400 14.310
23  2007  ADCT 167.900 18.700
24  2008  ADCT 173.300  6.340
25  2009  ADCT  84.700  8.340
26  2010  ADCT 121.400 12.670
27  2000 ALO.2 190.533 43.875
28  2001 ALO.2 163.601 26.450
29  2002 ALO.2 187.264 11.910
30  2003 ALO.2 155.228 20.100
31  2004 ALO.2 153.829 16.950
...

我的ddply的目的是计算按TIC分组的最后3个时期的价格滚动均值。我确保在运行代码之前至少有3次TIC观察。总共80,000行中有大约10,000个独特的TIC。

在另一篇文章的帮助下,我能够重用ave函数来完成我的任务:

old_df$last3<-ave(old_df$price, old_df$TIC, FUN=function(x) rollmean(x, k=3, align= "right",na.pad=T))

运行此代码大约需要1秒钟,并且可以令人满意地完成任务。

我正在运行Macbook Pro,16GB内存,2.8GHz Intel Core i7。如果有人能帮我诊断问题,我们将不胜感激!

Update1:​​以下是实际应用程序中运行时间的比较。我没有包括ddply结果,因为我不想等那么久:P

> system.time(test<-epdata %>% group_by(LPERMNO) %>% mutate(mean_price_3yr = roll_mean(ebitda, n=3, align="right", fill=NA)))
   user  system elapsed 
  0.570   0.007   0.577 

> system.time(epdata_2$delta_earnings<-ave(epdata_2$ebitda, epdata_2$LPERMNO, FUN=function(x) Delt(x, k=1, type = "arithmetic")))
   user  system elapsed 
  2.583   0.007   2.600 

1 个答案:

答案 0 :(得分:0)

我们可以使用roll_mean中的library(RcppRoll)dplyr来提高效率

library(RcppRoll)
library(dplyr)
old_df %>% 
    group_by(TIC) %>%
    mutate(mean_price_3yr = roll_mean(price, n=3,
                  align="right", fill=NA))

数据

old_df <- structure(list(fyear = c(2000L, 2001L, 2002L, 
   2003L, 2005L, 2006L, 
 2007L, 2008L, 2009L, 2010L, 2000L, 2001L, 2002L, 2003L, 
 2004L), TIC = c("AIR", "AIR", "AIR", "AIR", "ADCT", 
"ADCT", "ADCT", 
"ADCT", "ADCT", "ADCT", "ALO.2", "ALO.2", "ALO.2", 
"ALO.2", "ALO.2"
), ebitda = c(64.367, 27.207, 30.745, 47.491, 159, 140.4,
167.9, 173.3, 84.7, 121.4, 190.533, 163.601, 187.264, 
155.228, 153.829), price = c(14, 11.44, 4.5, 9.58, 17.45, 
14.31, 18.7, 6.34, 8.34, 12.67, 43.875, 26.45, 11.91,
20.1, 16.95)), .Names = c("fyear", "TIC", "ebitda", 
"price"), class = "data.frame", row.names = c(NA, -15L))