Question

假设我有以下数据框：

df <- data.frame("yearmonth"=c("2005-01","2005-02","2005-03","2005-01","2005-02","2005-03"),"state"=c(1,1,1,2,2,2),"county"=c(3,3,3,3,3,3),"unemp"=c(4.0,3.6,1.4,3.7,6.5,5.4))

我试图在每个独特的州 - 县组合中创造一个失业滞后。我想最终得到这个：

df2 <- data.frame("yearmonth"=c("2005-01","2005-02","2005-03","2005-01","2005-02","2005-03"),"state"=c(1,1,1,2,2,2),"county"=c(3,3,3,3,3,3),"unemp"=c(4.0,3.6,1.4,3.7,6.5,5.4),"unemp_lag"=c(NA,4.0,3.6,NA,3.7,6.5))

现在，想象一下这种情况，除了数千个不同的县 - 县组合和几年。我尝试使用lag函数，zoo.lag函数，但我无法考虑到州 - 县代码。一种可能性是制作一个巨大的for循环，但我认为这是太多的数据（R不能很好地处理循环），我正在寻找一种更清洁的方法来做到这一点。有任何想法吗？谢谢！

Answer 1

只是一种旧式的基础R方法：

dsp <- split(df, list(df$state, df$county) )
dsp <- lapply(dsp, function(x) transform(x, unemp_lag =lag(unemp)))
dsp <- unsplit(dsp, list(df$state, df$county))
dsp
yearmonth state county unemp unemp_lag
1   2005-01     1      3   4.0        NA
2   2005-02     1      3   3.6       4.0
3   2005-03     1      3   1.4       3.6
4   2005-01     2      3   3.7        NA
5   2005-02     2      3   6.5       3.7
6   2005-03     2      3   5.4       6.5

修改

我在我的解决方案中使用的lag函数是lag的{{1}}（即使我在BlondedDust评论之前没有意识到这一点）并且这里是 true 和真正的纯碱R解决方案（我希望）：

dplyr

Answer 2

使用data.table：

library(data.table)
setDT(df)[,`:=`(unemp_lag1=shift(unemp,n=1L,fill=NA, type="lag")),by=.(state, county)][]

   yearmonth state county unemp unemp_lag1
1:   2005-01     1      3   4.0         NA
2:   2005-02     1      3   3.6        4.0
3:   2005-03     1      3   1.4        3.6
4:   2005-01     2      3   3.7         NA
5:   2005-02     2      3   6.5        3.7
6:   2005-03     2      3   5.4        6.5

Answer 3

使用dplyr：

> library(dplyr)
> df %>% group_by(state, county) %>% mutate(unemp_lag=lag(unemp))
Source: local data frame [6 x 5]
Groups: state, county

   yearmonth state county unemp unemp_lag
1   2005-01     1      3   4.0        NA
2   2005-02     1      3   3.6       4.0
3   2005-03     1      3   1.4       3.6
4   2005-01     2      3   3.7        NA
5   2005-02     2      3   6.5       3.7
6   2005-03     2      3   5.4       6.5

使用data.table：

> df2 <- as.data.table(df)
> df2[, unemp_lag := c(NA , unemp[-.N]), by=list(state, county)]

   yearmonth state county unemp unemp_lag
1:   2005-01     1      3   4.0        NA
2:   2005-02     1      3   3.6       4.0
3:   2005-03     1      3   1.4       3.6
4:   2005-01     2      3   3.7        NA
5:   2005-02     2      3   6.5       3.7
6:   2005-03     2      3   5.4       6.5

R在特定子集中滞后？

3 个答案:

修改