它看起来很简单,但我无法在线找到答案。我拥有1995-2015年间具有城市特征的面板数据。对于某些变量,我只具有2000年和2010年的数据。因此,我想创建新的变量,用2000年的值和1995-2015年的2010年的值来估算缺少的1995-2004年的数据。
我的数据集如下例:
cities idhm year
1 B NA 1995
2 C NA 1996
3 D NA 1997
4 E NA 1998
5 F NA 1999
6 G 24599 2000
7 H NA 2001
8 I NA 2002
9 J NA 2003
10 K NA 2004
11 L NA 2005
12 M NA 2006
13 N NA 2007
14 O NA 2008
15 P NA 2009
16 Q 5598 2010
17 R NA 2011
18 S NA 2012
19 T NA 2013
20 U NA 2014
21 V NA 2015
我想要一个这样的数据集:
cities idhm year newvar
1 B NA 1995 24599
2 C NA 1996 24599
3 D NA 1997 24599
4 E NA 1998 24599
5 F NA 1999 24599
6 G 24599 2000 24599
7 H NA 2001 24599
8 I NA 2002 24599
9 J NA 2003 24599
10 K NA 2004 24599
11 L NA 2005 5598
12 M NA 2006 5598
13 N NA 2007 5598
14 O NA 2008 5598
15 P NA 2009 5598
16 Q 5598 2010 5598
17 R NA 2011 5598
18 S NA 2012 5598
19 T NA 2013 5598
20 U NA 2014 5598
21 V NA 2015 5598
欢迎任何帮助。
答案 0 :(得分:2)
我怀疑您的数据可能比此示例大,因此更一般的情况是使用滚动联接。我发现使用data.table
最简单。
首先,制作一个完整的数据字典以供加入。
library(data.table)
setDT(data1)
dictionary <- data1[!is.na(idhm),.(year,idhm)]
dictionary
# year idhm
#1: 2000 24599
#2: 2010 5598
然后执行联接on = "year"
和roll = "nearest"
。
result <- dictionary[data1,on = "year",roll="nearest"]
result[,.(cities,year,idhm)]
# cities year idhm
# 1: B 1995 24599
# 2: C 1996 24599
# 3: D 1997 24599
# 4: E 1998 24599
# 5: F 1999 24599
# 6: G 2000 24599
# 7: H 2001 24599
# 8: I 2002 24599
# 9: J 2003 24599
#10: K 2004 24599
#11: L 2005 24599
#12: M 2006 5598
#13: N 2007 5598
#14: O 2008 5598
#15: P 2009 5598
#16: Q 2010 5598
#17: R 2011 5598
#18: S 2012 5598
#19: T 2013 5598
#20: U 2014 5598
#21: V 2015 5598
# cities year idhm
数据
data1 <- structure(list(cities = structure(1:21, .Label = c("B", "C",
"D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P",
"Q", "R", "S", "T", "U", "V"), class = "factor"), idhm = c(NA,
NA, NA, NA, NA, 24599L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 5598L,
NA, NA, NA, NA, NA), year = 1995:2015), class = "data.frame", row.names = c(NA,
-21L))
答案 1 :(得分:1)
我们可以做到:
df$new_var <- NA
df$new_var[df$year >= 1995 & df$year <= 2004] <- df$idhm[df$year == 2000]
df$new_var[df$year >= 2005 & df$year <= 2015] <- df$idhm[df$year == 2010]
或使用dplyr
:
library(dplyr)
df %>%
mutate(new_var = case_when(between(year, 1995, 2004) ~idhm[year == 2000],
between(year, 2005, 2015) ~idhm[year == 2010]))
# cities idhm year new_var
#1 B NA 1995 24599
#2 C NA 1996 24599
#3 D NA 1997 24599
#4 E NA 1998 24599
#5 F NA 1999 24599
#6 G 24599 2000 24599
#7 H NA 2001 24599
#8 I NA 2002 24599
#9 J NA 2003 24599
#10 K NA 2004 24599
#11 L NA 2005 5598
#12 M NA 2006 5598
#13 N NA 2007 5598
#14 O NA 2008 5598
#15 P NA 2009 5598
#16 Q 5598 2010 5598
#17 R NA 2011 5598
#18 S NA 2012 5598
#19 T NA 2013 5598
#20 U NA 2014 5598
#21 V NA 2015 5598