R-根据过去的观察结果创建(更改)新列

时间:2018-11-19 13:39:15

标签: r dataframe calculated-columns mutate

好的,所以我有一个非常大的数据集,包含大约500个观察值和3个变量。第一列是时间。

对于我正在使用的测试数据集:

dat=as.data.frame(matrix(c(1,2,3,4,5,6,7,8,9,10,
        1,1.8,3.5,3.8,5.6,6.2,7.8,8.2,9.8,10.1,
        2,4.8,6.5,8.8,10.6,12.2,14.8,16.2,18.8,20.1),10,3))
colnames(dat)=c("Time","Var1","Var2")


   Time Var1 Var2
1     1  1.0  2.0
2     2  1.8  4.8
3     3  3.5  6.5
4     4  3.8  8.8
5     5  5.6 10.6
6     6  6.2 12.2
7     7  7.8 14.8
8     8  8.2 16.2
9     9  9.8 18.8
10   10 10.1 20.1

所以我需要做的是创建一个新列,每个观察值都是一些过去点相对于时间的斜率。例如,拿过去的3点,就像:

slopeVar1[i]=slope(Var1[i-2:i],Time[i-2:i]) #Not real code
slopeVar[i]=slope(Var2[i-2:i],Time[i-2:i])  #Not real code

    Time    Var1    Var2    slopeVar1   slopeVar2
1   1       1       2       NA          NA
2   2       1.8     4.8     NA          NA
3   3       3.5     6.5     1.25        2.25
4   4       3.8     8.8     1.00        2.00
5   5       5.6     10.6    1.05        2.05
6   6       6.2     12.2    1.20        1.70
7   7       7.8     14.8    1.10        2.10
8   8       8.2     16.2    1.00        2.00
9   9       9.8     18.8    1.00        2.00
10  10      10.1    20.1    0.95        1.95

我实际上可以使用for()函数,但是对于非常大的数据集(> 100,000),它开始花费的时间太长。

下面显示了我使用的for()参数:

#CREATE DATA FRAME
rm(dat)
  dat=as.data.frame(matrix(c(1,2,3,4,5,6,7,8,9,10,
              1,1.8,3.333,3.8,5.6,6.2,7.8,8.2,9.8,10.1,
              2,4.8,6.5,8.8,10.6,12.2,14.8,16.2,18.8,20.1),10,3))
  colnames(dat)=c("Time","Var1","Var2")
  dat
  plot(dat)

#CALCULATE SLOPE OF n POINTS FROM i TO i-n.
#In this case I am taking just 3 points, but it should 
#be possible to change the number of points taken. 

attach(dat)
n=3 #number for points to take slope
l=dim(dat[1])[1] #number of iterations
y=0
x=0
slopeVar1=NA
slopeVar2=NA
for (i in 1:l) {
    if   (i<n) {slopeVar1[i]=NA} #For the rows where there are not enough previous observations, it outputs NA
    if   (i>=n) {
      y1=Var1[(i-n+1):i] #y data sets for calculating slope of Var1
      y2=Var2[(i-n+1):i]#y data sets for calculating slope of Var2
      x=Time[(i-n+1):i] #x data sets for calculating slope of Var1&Var2

          z1=lm(y1~x) #Temporal value of slope of Var1
          z2=lm(y2~x) #Temporal value of slope of Var2
          slope1=as.data.frame(z1[1]) #Temporal value of slope of Var1
          slopeVar1[i]=slope1[2,1] #Populating string of slopeVar1
          slope2=as.data.frame(z2[1])#Temporal value of slope of Var2
          slopeVar2[i]=slope2[2,1] #Populating string of slopeVar2
          }
 }
slopeVar1 #Checking results. 
slopeVar2

(result=cbind(dat,slopeVar1,slopeVar2)) #Binds original data with new calculated slopes. 

这段代码实际上输出了我想要的;但是同样,对于非常大的数据集,效率很低。

1 个答案:

答案 0 :(得分:1)

这种rollapply的快速实现似乎正在加快速度-

library("zoo")
slope_func = function(period) { 
  y1=period[,2] #y data sets for calculating slope of Var1
  y2=period[,3] #y data sets for calculating slope of Var2
  x=period[,1] #x data sets for calculating slope of Var1&Var2
  z1=lm(y1~x) #Temporal value of slope of Var1
  z2=lm(y2~x) #Temporal value of slope of Var2
  slope1=as.data.frame(z1[1]) #Temporal value of slope of Var1
  slopeVar1[i]=slope1[2,1] #Populating string of slopeVar1
  slope2=as.data.frame(z1[1])#Temporal value of slope of Var2
  slopeVar2[i]=slope2[2,1] #Populating string of slopeVar2
  }
}

start = Sys.time()
rollapply(dat[1:3], FUN=slope_func, width=3, by.column=FALSE)
end=Sys.time()
print(end-start)

Time difference of 0.04980111 secs

OP以前的实现是将Time difference of 0.2666121 secs用作相同的