如何在处理时间和日期时加快模拟速度

时间:2015-01-18 13:11:11

标签: r

我正在尝试在R中编制预测方法。除了滞后测量的功率值之外,预测变量是天气变量。数据的准确性是一分钟,并且可以获得相应的时间和日期。 要向预测变量列表添加功率滞后值,我的目标是考虑最后十分钟的值。如果我确定数据库是完美的并且全年所有会议记录的值都可用,我可以简单地移动Power列但是因为情况可能并非总是如此,我每次使用以下代码来检查是否所有相应的十分钟滞后值都可用并提取它们并存储在矩阵中。问题在于,该过程非常耗时并且需要很长时间来模拟。在这里,我给出了可重复的例子。 我试图将as.POSIXlt更改为as.POSIXct,模拟时间变得更好但不是那么多。我还尝试将data.frame更改为矩阵(使用as。矩阵),但它将POSIXct值转换为“character”类。 我想知道你们中的任何人都可以建议一个更好的方法。谢谢。

rm(list = ls())
cat("\014")

st="2012/01/01"
et="2012/02/27"

st <- as.POSIXlt(as.Date(st))
et <- as.POSIXlt(as.Date(et))
time= seq(from=st, to=et,by=60)
time<as.POSIXlt(time)
#Window is the number of lag values
#leadTime is look-ahead time (forecast horizon)
leadTime=10;
Window=15;

zzzz=time[1:8000]
Total_Zone1=abind(matrix(rnorm(4000*2),4000*2,1), matrix(rnorm(4000*2),4000*2,1), matrix(rnorm(4000*2),4000*2,1),time[1:8000])
N_Train=nrow(Total_Zone1);
lag_Power=matrix(0,N_Train,Window)
colnames(Total_Zone1) <- c( "airtemp","humidity",  "Power", "time")
Total_Zone1<- as.data.frame(Total_Zone1)
for (tt in 4000:N_Train){
  Statlag=Total_Zone1$time[tt]-(leadTime+Window)*60
  EndLag=Total_Zone1$time[tt]-(leadTime)*60
  Index_lags=which((Total_Zone1$time>Statlag)&(Total_Zone1$time<=EndLag))
  if (size(Index_lags)[2]<Window) {
    Statlag2=Total_Zone1$time[tt]-24*60*60
    Index_lags2=which(Total_Zone1$time==Statlag2)
    tem1=rep(Total_Zone1[Index_lags2,c("Power")],Window-size(Index_lags)[2])
    lag_Power[tt,]=t(c(Total_Zone1[Index_lags,c("Power")],tem1))
  }else{
     lag_Power[tt,]=t(Total_Zone1[Index_lags,c("Power")])
  }
}

2 个答案:

答案 0 :(得分:1)

我建议的第一件事就是将填写缺失数据的任务与重新格式化数据的任务分开。您已经在for循环中合并了这两项任务,但我认为通过执行此分离可以简化和加速代码。

为了演示,我将使用我自己的简化示例代码和数据,这与您的类似。以下演示使用了一小时的时间点,从2012-01-01 09:00到2012-01-01 09:59。我已经为前20分钟生成了完整的数据,并为剩余的40分钟模拟了缺失数据:

timeFirst20Min <- seq(as.POSIXct('2012-01-01 09:00'),as.POSIXct('2012-01-01 09:19'),60);
timeNextScattered10 <- sort(sample(seq(as.POSIXct('2012-01-01 09:20'),as.POSIXct('2012-01-01 09:59'),60),10));
timeRaw <- c(timeFirst20Min,timeNextScattered10);

raw <- data.frame(airtemp=rnorm(30), humidity=rnorm(30), power=rnorm(30), time=timeRaw );

timeFull <- seq(as.POSIXct('2012-01-01 09:00'),as.POSIXct('2012-01-01 09:59'),60);

clean <- merge(raw, data.frame(time=timeFull), all=T );

clean;

这导致以下输出:

                  time     airtemp    humidity       power
1  2012-01-01 09:00:00 -1.84406568 -1.51751811 -0.09391613
2  2012-01-01 09:01:00 -0.58830616 -0.02222524 -0.07865711
3  2012-01-01 09:02:00  1.63189916 -1.35274437 -1.32762005
4  2012-01-01 09:03:00 -1.96424678  0.07326323 -0.29004060
5  2012-01-01 09:04:00 -0.60856223  1.15876062  0.29804843
6  2012-01-01 09:05:00 -0.11504175 -0.43092259  0.49660688
7  2012-01-01 09:06:00  0.24671752  0.58974028 -1.08949827
8  2012-01-01 09:07:00  0.88375964  0.35821875 -0.94881659
9  2012-01-01 09:08:00 -0.70284907 -2.39016066  0.03804497
10 2012-01-01 09:09:00  1.34738493  0.64676077  0.77714355
11 2012-01-01 09:10:00  0.32271920  1.83127669 -1.44754595
12 2012-01-01 09:11:00 -0.46106830 -1.82204149  1.28886422
13 2012-01-01 09:12:00  2.27484699 -1.29816051  0.67963614
14 2012-01-01 09:13:00 -0.25799515 -0.23864259  0.33414247
15 2012-01-01 09:14:00  0.27535436  1.15491712  1.25857697
16 2012-01-01 09:15:00  0.09447671  0.72562227  0.35426141
17 2012-01-01 09:16:00  2.46141564  0.10448464 -0.04715810
18 2012-01-01 09:17:00  0.32952643 -0.51843045 -0.44380916
19 2012-01-01 09:18:00  0.34571678  1.74200137  2.05858658
20 2012-01-01 09:19:00  0.62863598 -0.38642998  1.29170691
21 2012-01-01 09:20:00          NA          NA          NA
22 2012-01-01 09:21:00          NA          NA          NA
23 2012-01-01 09:22:00  0.83148733 -0.79160256 -0.98798400
24 2012-01-01 09:23:00          NA          NA          NA
25 2012-01-01 09:24:00          NA          NA          NA
26 2012-01-01 09:25:00          NA          NA          NA
27 2012-01-01 09:26:00 -0.80548958  0.99318673  0.49630420
28 2012-01-01 09:27:00          NA          NA          NA
29 2012-01-01 09:28:00 -1.12821534 -0.64763280  0.28033445
30 2012-01-01 09:29:00          NA          NA          NA
31 2012-01-01 09:30:00  1.77407448  0.21064525  0.97942804
32 2012-01-01 09:31:00          NA          NA          NA
33 2012-01-01 09:32:00 -0.71832822  0.89608884 -0.14460970
34 2012-01-01 09:33:00          NA          NA          NA
35 2012-01-01 09:34:00          NA          NA          NA
36 2012-01-01 09:35:00          NA          NA          NA
37 2012-01-01 09:36:00          NA          NA          NA
38 2012-01-01 09:37:00          NA          NA          NA
39 2012-01-01 09:38:00          NA          NA          NA
40 2012-01-01 09:39:00 -0.17353680  1.69963228  0.10863195
41 2012-01-01 09:40:00  1.03484529  1.69734966  3.00898820
42 2012-01-01 09:41:00          NA          NA          NA
43 2012-01-01 09:42:00          NA          NA          NA
44 2012-01-01 09:43:00 -1.13983790  1.51982921  0.58583242
45 2012-01-01 09:44:00          NA          NA          NA
46 2012-01-01 09:45:00          NA          NA          NA
47 2012-01-01 09:46:00          NA          NA          NA
48 2012-01-01 09:47:00          NA          NA          NA
49 2012-01-01 09:48:00          NA          NA          NA
50 2012-01-01 09:49:00          NA          NA          NA
51 2012-01-01 09:50:00          NA          NA          NA
52 2012-01-01 09:51:00          NA          NA          NA
53 2012-01-01 09:52:00 -0.93153818  0.77736403  2.08962697
54 2012-01-01 09:53:00          NA          NA          NA
55 2012-01-01 09:54:00  2.26031675 -0.49847630 -1.48191078
56 2012-01-01 09:55:00          NA          NA          NA
57 2012-01-01 09:56:00          NA          NA          NA
58 2012-01-01 09:57:00          NA          NA          NA
59 2012-01-01 09:58:00          NA          NA          NA
60 2012-01-01 09:59:00          NA          NA          NA

因此,第一项任务是在clean上插入缺失的功率值。现在,在您的代码中,您似乎正在向后看一个完整的24小时时段来获取在基准时间点之前的那个时间存在的功率值(如果它存在;它不能保证存在,是吗?),并将其复制到矢量的末尾,表示在窗口期间采集的功率测量值。这是我所指的代码:

Statlag2=Total_Zone1$time[tt]-24*60*60
Index_lags2=which(Total_Zone1$time==Statlag2)
tem1=rep(Total_Zone1[Index_lags2,c("Power")],Window-size(Index_lags)[2])
lag_Power[tt,]=t(c(Total_Zone1[Index_lags,c("Power")],tem1))

我不确定为什么你会在24小时后退,特别是考虑到窗口只有15分钟,并且只在基准时间点之前10分钟结束。我也不确定为什么你已经将插值值(足以复制以弥补可用功率值的不足)添加到{{1}上存储的行的 end 上}};它不应该是开始,尤其是考虑到插值是在窗口之前的这么长时间内出现的吗?我可能在这里错过了重要的背景;如果我错过了什么,请评论。

在任何情况下,我要做的是及时填写最近可用值的缺失功率值。我找到了一个很好的最近插值函数,可以直接在Replacing NAs in R with nearest value的单个向量上工作。在这里,我自己重新格式化:

lag_Power

您可以使用此功能轻松地从最近的可用值插入所有缺失的功率值:

interpolateMissing <- function(dat) {

    N <- length(dat);
    missing <- which(is.na(dat));
    if (length(missing) %in% c(0,N)) return(dat);

    notMissing <- which(!is.na(dat));
    intervals <- findInterval(missing, notMissing, all.inside=T );

    leftPos <- notMissing[pmax(1,intervals)];
    rightPos <- notMissing[pmin(N,intervals+1)];
    leftDist <- missing - leftPos;
    rightDist <- rightPos - missing;

    dat[missing] <- ifelse(leftDist <= rightDist, dat[leftPos], dat[rightPos] );

    return(dat);

};

clean$power <- interpolateMissing(clean$power); 现在看起来像这样:

clean

所以,现在可以使用了,无需担心缺少值。

最后一点是,我不确定构建 time airtemp humidity power 1 2012-01-01 09:00:00 -1.84406568 -1.51751811 -0.09391613 2 2012-01-01 09:01:00 -0.58830616 -0.02222524 -0.07865711 3 2012-01-01 09:02:00 1.63189916 -1.35274437 -1.32762005 4 2012-01-01 09:03:00 -1.96424678 0.07326323 -0.29004060 5 2012-01-01 09:04:00 -0.60856223 1.15876062 0.29804843 6 2012-01-01 09:05:00 -0.11504175 -0.43092259 0.49660688 7 2012-01-01 09:06:00 0.24671752 0.58974028 -1.08949827 8 2012-01-01 09:07:00 0.88375964 0.35821875 -0.94881659 9 2012-01-01 09:08:00 -0.70284907 -2.39016066 0.03804497 10 2012-01-01 09:09:00 1.34738493 0.64676077 0.77714355 11 2012-01-01 09:10:00 0.32271920 1.83127669 -1.44754595 12 2012-01-01 09:11:00 -0.46106830 -1.82204149 1.28886422 13 2012-01-01 09:12:00 2.27484699 -1.29816051 0.67963614 14 2012-01-01 09:13:00 -0.25799515 -0.23864259 0.33414247 15 2012-01-01 09:14:00 0.27535436 1.15491712 1.25857697 16 2012-01-01 09:15:00 0.09447671 0.72562227 0.35426141 17 2012-01-01 09:16:00 2.46141564 0.10448464 -0.04715810 18 2012-01-01 09:17:00 0.32952643 -0.51843045 -0.44380916 19 2012-01-01 09:18:00 0.34571678 1.74200137 2.05858658 20 2012-01-01 09:19:00 0.62863598 -0.38642998 1.29170691 21 2012-01-01 09:20:00 NA NA 1.29170691 22 2012-01-01 09:21:00 NA NA -0.98798400 23 2012-01-01 09:22:00 0.83148733 -0.79160256 -0.98798400 24 2012-01-01 09:23:00 NA NA -0.98798400 25 2012-01-01 09:24:00 NA NA -0.98798400 26 2012-01-01 09:25:00 NA NA 0.49630420 27 2012-01-01 09:26:00 -0.80548958 0.99318673 0.49630420 28 2012-01-01 09:27:00 NA NA 0.49630420 29 2012-01-01 09:28:00 -1.12821534 -0.64763280 0.28033445 30 2012-01-01 09:29:00 NA NA 0.28033445 31 2012-01-01 09:30:00 1.77407448 0.21064525 0.97942804 32 2012-01-01 09:31:00 NA NA 0.97942804 33 2012-01-01 09:32:00 -0.71832822 0.89608884 -0.14460970 34 2012-01-01 09:33:00 NA NA -0.14460970 35 2012-01-01 09:34:00 NA NA -0.14460970 36 2012-01-01 09:35:00 NA NA -0.14460970 37 2012-01-01 09:36:00 NA NA 0.10863195 38 2012-01-01 09:37:00 NA NA 0.10863195 39 2012-01-01 09:38:00 NA NA 0.10863195 40 2012-01-01 09:39:00 -0.17353680 1.69963228 0.10863195 41 2012-01-01 09:40:00 1.03484529 1.69734966 3.00898820 42 2012-01-01 09:41:00 NA NA 3.00898820 43 2012-01-01 09:42:00 NA NA 0.58583242 44 2012-01-01 09:43:00 -1.13983790 1.51982921 0.58583242 45 2012-01-01 09:44:00 NA NA 0.58583242 46 2012-01-01 09:45:00 NA NA 0.58583242 47 2012-01-01 09:46:00 NA NA 0.58583242 48 2012-01-01 09:47:00 NA NA 0.58583242 49 2012-01-01 09:48:00 NA NA 2.08962697 50 2012-01-01 09:49:00 NA NA 2.08962697 51 2012-01-01 09:50:00 NA NA 2.08962697 52 2012-01-01 09:51:00 NA NA 2.08962697 53 2012-01-01 09:52:00 -0.93153818 0.77736403 2.08962697 54 2012-01-01 09:53:00 NA NA 2.08962697 55 2012-01-01 09:54:00 2.26031675 -0.49847630 -1.48191078 56 2012-01-01 09:55:00 NA NA -1.48191078 57 2012-01-01 09:56:00 NA NA -1.48191078 58 2012-01-01 09:57:00 NA NA -1.48191078 59 2012-01-01 09:58:00 NA NA -1.48191078 60 2012-01-01 09:59:00 NA NA -1.48191078 矩阵的好处是什么。每个窗口都与其他窗口重叠,因此该矩阵将包含重要的数据重复。相反,我会查看该预备代码下游如何使用该数据,并查看是否可以将其更改为直接在lag_Power(或您的代码{{}中访问特定于窗口的值1}})。这可以节省处理时间和内存影响。

答案 1 :(得分:0)

rolling merge包中的data.table应该可以正常运行:

#Bring in data.table package
library(data.table)
setDT(Total_Zone1) #Convert data to data.table
leadTime=10
Window=15
Total_Zone1[, time2 := time-60*(leadTime)]

#Create lag table
Lag_Table <- Total_Zone1[,.(time,Lag.Power=Power)]

#Set keys and perform rolling merge
setkey(Total_Zone1,time2)
setkey(Lag_Table,time)

#Perform rolling merge in data.table
Result <- Lag_Table[Total_Zone1,roll=+(Window*60)]

结果:

Result[1:15]

          time   Lag.Power    airtemp   humidity       Power     i.time
 1: 1325375400          NA  0.5341823  0.6114278  0.01378526 1325376000
 2: 1325375460          NA -0.4063909  1.0928144 -0.63712941 1325376060
 3: 1325375520          NA -1.3841581 -0.6204582  0.80553870 1325376120
 4: 1325375580          NA  1.5146208 -0.4343324 -1.29370219 1325376180
 5: 1325375640          NA -1.4949108  0.1416236 -0.64836292 1325376240
 6: 1325375700          NA -1.7832364 -1.2018948 -1.13257989 1325376300
 7: 1325375760          NA -1.2111490  0.5819049  0.22450291 1325376360
 8: 1325375820          NA -0.2837945  0.3196630  0.43199908 1325376420
 9: 1325375880          NA -1.2773624  0.3724754 -1.31807585 1325376480
10: 1325375940          NA  0.7710507 -1.0114118 -0.85951358 1325376540
11: 1325376000  0.01378526  0.7983597  1.7683065  0.38032568 1325376600
12: 1325376060 -0.63712941  0.3813563 -0.3905894  1.47871752 1325376660
13: 1325376120  0.80553870  0.7421624 -1.2265799  0.34509387 1325376720
14: 1325376180 -1.29370219  0.7768929 -1.2244243 -0.02769379 1325376780
15: 1325376240 -0.64836292 -0.8311352  0.8486815  0.16926006 1325376840