我有一个大型数据集,如下所示:
set.seed(1234)
id <- c(3,3,3,5,5,7)
amount <- c(24,48,60,84,96,175)
start <- as.Date(c("2006-01-01","2009-12-09","2010-01-01","2006-04-24", "2009-12-09","2009-05-01"))
end <- as.Date(c("2010-01-01","2010-01-01","2010-01-01","2009-12-09","2009-12-09", "2009-05-01"))
noise <-rnorm(6)
test <- data.frame(id,amount,start,end,noise)
id amount start end noise
3 24 2006-01-01 2010-01-01 0.4978505
3 48 2009-12-09 2010-01-01 -1.9666172
3 60 2010-01-01 2010-01-01 0.7013559
5 84 2006-04-24 2009-12-09 -0.4727914
5 96 2009-12-09 2009-12-09 -1.0678237
7 175 2009-05-01 2009-05-01 -0.2179749
但它需要看起来像这样:
id amount start end noise switch
3 24 2006-01-01 2009-12-09 0.4978505 0
3 48 2009-12-09 2010-01-01 -1.9666172 1
3 60 2010-01-01 2010-01-01 0.7013559 2
5 84 2006-04-24 2009-12-09 -0.4727914 0
5 96 2009-12-09 2009-12-09 -1.0678237 1
7 175 2009-05-01 2009-05-01 -0.2179749 0
也就是说,我想延迟start的值,并用ID替换end的值。其次,我想创建一个名为'switch'的新变量,它计算id上'amount'变化的次数,第一个观察值为== 0表示初始条件。我已经尝试使用ts()
来制造滞后,虽然它产生了一个ts对象而不是一个Date,但它原则上做了我想做的事情:
out <- cbind(as.ts(test$start),lag(test$start))
colnames(out) <- c("start","end")
cbind(as.ts(test$start),lag(test$start))
as.ts(test$start) lag(test$start)
NA 13149
13149 14587
14587 14610
14610 13262
13262 14587
14587 14365
14365 NA
所以lag(test$start)
列是我的结果应该是什么样的,但是应用于id变量。所以我尝试对id变量进行矢量化并应用它:
#make it a function
lagfun <- function(x){
cbind(as.ts(x),lag(x))
}
y <- unlist(tapply(start,id,lagfun))
这就是事情变得非常丑陋的地方。有没有更好的方法来解决这个问题?
答案 0 :(得分:5)
如果您将时间序列放在data.table
中,则可以在一行中完成此操作:
testDT[ , c("end", "switch") :=
list( c(tail(start, -1), tail(end, 1)), cumsum(c(0, diff(amount) != 0)))
, by=id]
这里分解了:
# create your data.table object
library(data.table)
testDT <- data.table(test)
# Modify `end` by taking the lag of start and the final date from end.
# do this `by=id`
testDT[, end := c(tail(start, -1), tail(end, 1)), by=id]
# Count the ammount of times that each amount differs from the
# previous ammount value.
# Start this vector at 0, and take the cummulative sum.
# also do this by id
testDT[, switch := cumsum(c(0, diff(amount) != 0)), by=id]
# this is the final result.
testDT
id amount start end noise switch
1: 3 24 2006-01-01 2009-12-09 -1.2070657 0
2: 3 48 2009-12-09 2010-01-01 0.2774292 1
3: 3 60 2010-01-01 2010-01-01 1.0844412 2
4: 5 84 2006-04-24 2009-12-09 -2.3456977 0
5: 5 96 2009-12-09 2009-12-09 0.4291247 1
6: 7 175 2009-05-01 2009-05-01 0.5060559 0