Question

我有一个数据文件，其中包含30秒持续时间采样的读数。文件组织是：

> head(dframe)
            timestamp    power
1 2015-08-01 00:00:04 584.1379
2 2015-08-01 00:00:34 585.8087
3 2015-08-01 00:01:04 584.9335
4 2015-08-01 00:01:34 584.4366
5 2015-08-01 00:02:04 584.2829

现在将30秒持续时间数据表示为每小时平均值我使用以下R命令：

df = aggregate(list(power=dframe$power),by=list(timestamp=cut(as.POSIXct(dframe$timestamp),"hour")),mean)

这完美无缺。但是，实际问题是大文件（一年的数据）需要的时间。我可以以某种方式减少转换过程所需的时间吗？换句话说，还有其他最好的替代方案，它可以花费更少的时间将秒数据转换为R中的小时平均数据吗？

更新：我使用了4种不同的方法来解决@akrun和@Joshua提出的相同问题。对于堆栈溢出的其他用户，我提供了所有方法的使用和相应的时间

dframe<-read.csv(path,head=TRUE,sep=",")
dframe$timestamp<- as.POSIXct(dframe$timestamp)
xframe = dframe
#using aggregate
system.time(
df1<- aggregate(list(power=dframe$power),by=list(timestamp=cut(dframe$timestamp,"hour")),mean)
)
# using data.table
system.time(
dfx<-setDT(dframe)[, list(power= mean(power)) ,(timestamp= cut(timestamp, 'hour'))]
)
# using dplyr
system.time( 
xframe %>% group_by(timestamp= cut(timestamp, 'hour')) %>% summarise(power=mean(power))
)
#using xts
system.time({
  x <- xts(dframe$power,dframe$timestamp)
  h <- period.apply(x, endpoints(x, "hours"), mean)
  h <- data.frame(timestamp=trunc(index(h),'hours'), power=coredata(h))
})

两个（一个月，三个月）不同数据集的相应时间为：一个月的数据集：

Method       user  system elapsed 
Aggregate    0.137   0.005   0.142
data.table   0.031   0.001   0.032 
dplyr        0.035   0.001   0.036  
xts          0.053   0.000   0.053

三个月的数据集：

Aggregate    0.456   0.019   0.475 
data.table   0.099   0.002   0.102  
dplyr        0.099   0.004   0.103  
xts          0.158   0.004   0.161

警告：除 xts 之外的所有方法都会将时间戳类型从POSIXct更改为Factor。这意味着您必须再次转换timestamp列的类型，这将导致更多的cpu周期。简而言之，如果最后你还需要POSIXct时间戳，那么xts是最好的，否则就去了data.table。

DATASET 使用的数据集可在link

找到

Answer 1

使用xts包中的工具，您可以在不到一半的时间内完成此聚合。

# sample data
set.seed(21)
N <- 2e6
dframe <- data.frame(timestamp=seq(Sys.time(), by="30 sec", length.out=N),
                     power=rnorm(N))
# aggregate
system.time(a <- aggregate(list(power=dframe$power),by=list(timestamp=cut(dframe$timestamp,"hour")), mean))
#    user  system elapsed 
#   2.456   0.000   2.457 

# xts
system.time({
  x <- xts(dframe$power, dframe$timestamp)
  h <- period.apply(x, endpoints(x, "hours"), mean)
  h <- data.frame(timestamp=trunc(index(h),'hours'), power=coredata(h))
})
#    user  system elapsed 
#   0.888   0.004   0.893

Answer 2

一般来说，aggregate很慢。我们可以使用data.table来快速完成此操作。转换＆＃39; data.frame＆＃39;到＆＃39; data.table＆＃39; （setDT(dframe)），我们使用cut从＆＃39; timestamp＆＃39;创建分组变量。并获得mean的权力＆＃39;。

library(data.table)
setDT(dframe)[, list(power= mean(power)) ,(timestamp= cut(as.POSIXct(timestamp), 'hour'))]

将秒间隔数据转换为LESS TIME中的每小时平均值表示

2 个答案: