我有一个大约2M行的巨大数据帧。结构是这样的:
head(df)
TimeStamp Price Contracts
1 9/29/2014 9:00:03 AM 16.05 10
2 9/29/2014 9:00:03 AM 16.04 1
3 9/29/2014 9:00:06 AM 16.05 2
4 9/29/2014 9:00:07 AM 16.05 11
5 9/29/2014 9:00:07 AM 16.05 1
6 9/29/2014 9:00:07 AM 16.05 1
如您所见,有几个时间戳有多个条目。我试图通过加权平均值将这个df按时间戳汇总。我正在使用dplyr包,下面是我正在使用的代码,结果为:
as.data.frame(df %>% group_by(TimeStamp) %>% summarise(VWAP = weighted.mean(Price, Contracts)))
TimeStamp VWAP
1 9/29/2014 10:00:08 AM 16.09000
2 9/29/2014 10:00:11 AM 16.09000
3 9/29/2014 10:00:17 AM 16.10012
4 9/29/2014 10:00:18 AM 16.12500
5 9/29/2014 10:00:20 AM 16.15000
6 9/29/2014 10:00:22 AM 16.10000
说WHAAA ??? 9AM数据怎么了?!?!结果似乎跳过了数据帧的第一个块!所以,我只拿了前几千行,重新运行聚合,结果如下:
as.data.frame(head(df, 1000) %>% group_by(TimeStamp) %>% summarise(VWAP = weighted.mean(Price, Contracts)))
TimeStamp VWAP
1 9/29/2014 9:00:03 AM 16.04909
2 9/29/2014 9:00:06 AM 16.05000
3 9/29/2014 9:00:07 AM 16.05000
4 9/29/2014 9:00:08 AM 16.05500
5 9/29/2014 9:00:13 AM 16.06000
6 9/29/2014 9:00:20 AM 16.05000
他们在那里!!什么交易,这里?!?!所以,我一直在玩df的大小。我发现在大约3000行时,返回的df似乎开始写自己:
as.data.frame(head(df, 3100) %>% group_by(TimeStamp) %>% summarise(VWAP = weighted.mean(Price, Contracts)))
> head(tester1)
TimeStamp VWAP
1 9/29/2014 10:00:08 AM 16.09000
2 9/29/2014 10:00:11 AM 16.09000
3 9/29/2014 10:00:17 AM 16.10012
4 9/29/2014 9:00:03 AM 16.04909
5 9/29/2014 9:00:06 AM 16.05000
6 9/29/2014 9:00:07 AM 16.05000
所以看起来dply中存在某种内存限制?那是怎么回事?我查看了文档并没有找到任何内容......
答案 0 :(得分:2)
好吧,多亏大卫,我明白了。 df被分类为好像是字符(它们是)。所以,df很好,我使用以下内容对它进行排序并按照我的预期得到它。 (随意提出任何更好或更优雅的方法......)
df$TimeStamp <- strptime(df$TimeStamp, format = "%m/%d/%Y %I:%M:%S %p")
df<- df[order(df$TimeStamp),]
答案 1 :(得分:1)
这是使用lubridate和data.table包的解决方案。
library(data.table)
library(lubridate)
setDT(df)
# convert timestamp to a POSIXct object to allow for proper sorting by time
df[ , TimeStamp := mdy_hms(TimeStamp)]
# sort by time
setkey(df, TimeStamp)
# perform weighted mean by Timestamp
df[ , .(VWAP = weighted.mean(Price, Contracts)), by = key(df)]