Question

我有一个大型数据集，如下所示：

set.seed(1234)
id <- c(3,3,3,5,5,7)
amount <- c(24,48,60,84,96,175)
start <- as.Date(c("2006-01-01","2009-12-09","2010-01-01","2006-04-24", "2009-12-09","2009-05-01"))
end <- as.Date(c("2010-01-01","2010-01-01","2010-01-01","2009-12-09","2009-12-09", "2009-05-01"))               
noise <-rnorm(6)
test <- data.frame(id,amount,start,end,noise)            

  id amount      start        end      noise
   3     24 2006-01-01 2010-01-01  0.4978505
   3     48 2009-12-09 2010-01-01 -1.9666172
   3     60 2010-01-01 2010-01-01  0.7013559
   5     84 2006-04-24 2009-12-09 -0.4727914
   5     96 2009-12-09 2009-12-09 -1.0678237
   7    175 2009-05-01 2009-05-01 -0.2179749

但它需要看起来像这样：

  id amount      start        end      noise   switch
   3     24 2006-01-01 2009-12-09  0.4978505        0
   3     48 2009-12-09 2010-01-01 -1.9666172        1
   3     60 2010-01-01 2010-01-01  0.7013559        2
   5     84 2006-04-24 2009-12-09 -0.4727914        0 
   5     96 2009-12-09 2009-12-09 -1.0678237        1
   7    175 2009-05-01 2009-05-01 -0.2179749        0

也就是说，我想延迟start的值，并用ID替换end的值。其次，我想创建一个名为'switch'的新变量，它计算id上'amount'变化的次数，第一个观察值为== 0表示初始条件。我已经尝试使用ts()来制造滞后，虽然它产生了一个ts对象而不是一个Date，但它原则上做了我想做的事情：

       out <- cbind(as.ts(test$start),lag(test$start))
       colnames(out) <- c("start","end")
       cbind(as.ts(test$start),lag(test$start))

         as.ts(test$start) lag(test$start)
            NA           13149
          13149           14587
          14587           14610
          14610           13262
          13262           14587
          14587           14365
          14365              NA

所以lag(test$start)列是我的结果应该是什么样的，但是应用于id变量。所以我尝试对id变量进行矢量化并应用它：

        #make it a function 
        lagfun <- function(x){
          cbind(as.ts(x),lag(x))
        }

        y <- unlist(tapply(start,id,lagfun))

这就是事情变得非常丑陋的地方。有没有更好的方法来解决这个问题？

Answer 1

如果您将时间序列放在data.table中，则可以在一行中完成此操作：

testDT[ , c("end", "switch") := 
          list( c(tail(start, -1), tail(end, 1)), cumsum(c(0, diff(amount) != 0)))
      , by=id]

这里分解了：

# create your data.table object 
library(data.table)
testDT <- data.table(test)


# Modify `end` by taking the lag of start and the final date from end. 
#   do this `by=id`
testDT[, end := c(tail(start, -1), tail(end, 1)), by=id]

# Count the ammount of times that each amount differs from the 
#  previous ammount value.  
# Start this vector at 0, and take the cummulative sum. 
#  also do this by id 
testDT[, switch := cumsum(c(0, diff(amount) != 0)), by=id]

# this is the final result. 
testDT
   id amount      start        end      noise switch
1:  3     24 2006-01-01 2009-12-09 -1.2070657      0
2:  3     48 2009-12-09 2010-01-01  0.2774292      1
3:  3     60 2010-01-01 2010-01-01  1.0844412      2
4:  5     84 2006-04-24 2009-12-09 -2.3456977      0
5:  5     96 2009-12-09 2009-12-09  0.4291247      1
6:  7    175 2009-05-01 2009-05-01  0.5060559      0

使用data.frame中的相邻列创建滞后

1 个答案: