尝试与dplyr聚合时数据太大了?

时间:2015-04-08 18:45:12

标签: r dplyr

我有一个大约2M行的巨大数据帧。结构是这样的:

head(df)
         TimeStamp Price Contracts
1 9/29/2014 9:00:03 AM 16.05        10
2 9/29/2014 9:00:03 AM 16.04         1
3 9/29/2014 9:00:06 AM 16.05         2
4 9/29/2014 9:00:07 AM 16.05        11
5 9/29/2014 9:00:07 AM 16.05         1
6 9/29/2014 9:00:07 AM 16.05         1

如您所见,有几个时间戳有多个条目。我试图通过加权平均值将这个df按时间戳汇总。我正在使用dplyr包,下面是我正在使用的代码,结果为:

as.data.frame(df %>% group_by(TimeStamp) %>% summarise(VWAP = weighted.mean(Price, Contracts)))
              TimeStamp     VWAP
1 9/29/2014 10:00:08 AM 16.09000
2 9/29/2014 10:00:11 AM 16.09000
3 9/29/2014 10:00:17 AM 16.10012
4 9/29/2014 10:00:18 AM 16.12500
5 9/29/2014 10:00:20 AM 16.15000
6 9/29/2014 10:00:22 AM 16.10000

说WHAAA ??? 9AM数据怎么了?!?!结果似乎跳过了数据帧的第一个块!所以,我只拿了前几千行,重新运行聚合,结果如下:

as.data.frame(head(df, 1000) %>% group_by(TimeStamp) %>% summarise(VWAP = weighted.mean(Price, Contracts)))
             TimeStamp     VWAP
1 9/29/2014 9:00:03 AM 16.04909
2 9/29/2014 9:00:06 AM 16.05000
3 9/29/2014 9:00:07 AM 16.05000
4 9/29/2014 9:00:08 AM 16.05500
5 9/29/2014 9:00:13 AM 16.06000
6 9/29/2014 9:00:20 AM 16.05000

他们在那里!!什么交易,这里?!?!所以,我一直在玩df的大小。我发现在大约3000行时,返回的df似乎开始写自己:

as.data.frame(head(df, 3100) %>% group_by(TimeStamp) %>% summarise(VWAP = weighted.mean(Price, Contracts)))
> head(tester1)
              TimeStamp     VWAP
1 9/29/2014 10:00:08 AM 16.09000
2 9/29/2014 10:00:11 AM 16.09000
3 9/29/2014 10:00:17 AM 16.10012
4  9/29/2014 9:00:03 AM 16.04909
5  9/29/2014 9:00:06 AM 16.05000
6  9/29/2014 9:00:07 AM 16.05000

所以看起来dply中存在某种内存限制?那是怎么回事?我查看了文档并没有找到任何内容......

2 个答案:

答案 0 :(得分:2)

好吧,多亏大卫,我明白了。 df被分类为好像是字符(它们是)。所以,df很好,我使用以下内容对它进行排序并按照我的预期得到它。 (随意提出任何更好或更优雅的方法......)

df$TimeStamp <- strptime(df$TimeStamp, format = "%m/%d/%Y %I:%M:%S %p")
df<- df[order(df$TimeStamp),]

答案 1 :(得分:1)

这是使用lubridate和data.table包的解决方案。

library(data.table)
library(lubridate)
setDT(df)
# convert timestamp to a POSIXct object to allow for proper sorting by time
df[ , TimeStamp := mdy_hms(TimeStamp)]
# sort by time
setkey(df, TimeStamp)
# perform weighted mean by Timestamp
df[ , .(VWAP = weighted.mean(Price, Contracts)), by = key(df)]