根据R中另一个变量的两年值创建一个变量

时间:2020-04-03 02:49:19

标签: r missing-data panel-data data-transform

它看起来很简单,但我无法在线找到答案。我拥有1995-2015年间具有城市特征的面板数据。对于某些变量,我只具有2000年和2010年的数据。因此,我想创建新的变量,用2000年的值和1995-2015年的2010年的值来估算缺少的1995-2004年的数据。

我的数据集如下例:

   cities  idhm year
1       B    NA 1995
2       C    NA 1996
3       D    NA 1997
4       E    NA 1998
5       F    NA 1999
6       G 24599 2000
7       H    NA 2001
8       I    NA 2002
9       J    NA 2003
10      K    NA 2004
11      L    NA 2005
12      M    NA 2006
13      N    NA 2007
14      O    NA 2008
15      P    NA 2009
16      Q  5598 2010
17      R    NA 2011
18      S    NA 2012
19      T    NA 2013
20      U    NA 2014
21      V    NA 2015

我想要一个这样的数据集:

   cities  idhm year newvar
1       B    NA 1995  24599
2       C    NA 1996  24599
3       D    NA 1997  24599
4       E    NA 1998  24599
5       F    NA 1999  24599
6       G 24599 2000  24599
7       H    NA 2001  24599
8       I    NA 2002  24599
9       J    NA 2003  24599
10      K    NA 2004  24599
11      L    NA 2005   5598
12      M    NA 2006   5598
13      N    NA 2007   5598
14      O    NA 2008   5598
15      P    NA 2009   5598
16      Q  5598 2010   5598
17      R    NA 2011   5598
18      S    NA 2012   5598
19      T    NA 2013   5598
20      U    NA 2014   5598
21      V    NA 2015   5598

欢迎任何帮助。

2 个答案:

答案 0 :(得分:2)

我怀疑您的数据可能比此示例大,因此更一般的情况是使用滚动联接。我发现使用data.table最简单。

首先,制作一个完整的数据字典以供加入。

library(data.table)
setDT(data1)
dictionary <- data1[!is.na(idhm),.(year,idhm)]
dictionary
#   year  idhm
#1: 2000 24599
#2: 2010  5598

然后执行联接on = "year"roll = "nearest"

result <- dictionary[data1,on = "year",roll="nearest"]
result[,.(cities,year,idhm)]
#   cities year  idhm
# 1:      B 1995 24599
# 2:      C 1996 24599
# 3:      D 1997 24599
# 4:      E 1998 24599
# 5:      F 1999 24599
# 6:      G 2000 24599
# 7:      H 2001 24599
# 8:      I 2002 24599
# 9:      J 2003 24599
#10:      K 2004 24599
#11:      L 2005 24599
#12:      M 2006  5598
#13:      N 2007  5598
#14:      O 2008  5598
#15:      P 2009  5598
#16:      Q 2010  5598
#17:      R 2011  5598
#18:      S 2012  5598
#19:      T 2013  5598
#20:      U 2014  5598
#21:      V 2015  5598
#    cities year  idhm

数据

data1 <- structure(list(cities = structure(1:21, .Label = c("B", "C", 
"D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", 
"Q", "R", "S", "T", "U", "V"), class = "factor"), idhm = c(NA, 
NA, NA, NA, NA, 24599L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 5598L, 
NA, NA, NA, NA, NA), year = 1995:2015), class = "data.frame", row.names = c(NA, 
-21L))

答案 1 :(得分:1)

我们可以做到:

df$new_var <- NA
df$new_var[df$year >= 1995 & df$year <= 2004] <- df$idhm[df$year == 2000]
df$new_var[df$year >= 2005 & df$year <= 2015] <- df$idhm[df$year == 2010]

或使用dplyr

library(dplyr)

df %>%
   mutate(new_var = case_when(between(year, 1995, 2004) ~idhm[year == 2000], 
                         between(year, 2005, 2015) ~idhm[year == 2010]))


#   cities  idhm year new_var
#1       B    NA 1995   24599
#2       C    NA 1996   24599
#3       D    NA 1997   24599
#4       E    NA 1998   24599
#5       F    NA 1999   24599
#6       G 24599 2000   24599
#7       H    NA 2001   24599
#8       I    NA 2002   24599
#9       J    NA 2003   24599
#10      K    NA 2004   24599
#11      L    NA 2005    5598
#12      M    NA 2006    5598
#13      N    NA 2007    5598
#14      O    NA 2008    5598
#15      P    NA 2009    5598
#16      Q  5598 2010    5598
#17      R    NA 2011    5598
#18      S    NA 2012    5598
#19      T    NA 2013    5598
#20      U    NA 2014    5598
#21      V    NA 2015    5598