在R中重新编码纵向变量

时间:2016-06-15 11:20:57

标签: r

我有一个纵向数据框prueba,随着时间的推移,它遵循不同的单位(变量LA)(变量timeyear)。前25个观察结果具有以下结构。

> head(prueba, 25)
                     LA month year entry exit total homes
1  Barking and Dagenham    10 2010     2    0     2    NA
2  Barking and Dagenham    11 2010     3    0     3    NA
3  Barking and Dagenham    12 2010     3    0     3    15
4  Barking and Dagenham     1 2011     6    0     6    NA
5  Barking and Dagenham     2 2011     1    0     1    NA
6  Barking and Dagenham     3 2011     2    0     2    NA
7  Barking and Dagenham     4 2011     1    0     1    NA
8  Barking and Dagenham    10 2011     1    0     1    NA
9  Barking and Dagenham    11 2011     1    0     1    NA
10 Barking and Dagenham     1 2012     1    0     1    NA
11 Barking and Dagenham     9 2012     1    0     1    NA
12 Barking and Dagenham     6 2013     2    0     2    NA
13 Barking and Dagenham     1 2014     0    1    -1    NA
14 Barking and Dagenham    12 2014     0    1    -1    NA
15 Barking and Dagenham     3 2015     1    1     0    NA
16 Barking and Dagenham    11 2015     1    1     0    NA
17 Barking and Dagenham    12 2015     1    0     1    NA
18               Barnet    11 2010    24    0    24    NA
19               Barnet    12 2010    28    0    28    86
20               Barnet     1 2011    28    0    28    NA
21               Barnet     2 2011     6    0     6    NA
22               Barnet     3 2011     1    0     1    NA
23               Barnet     4 2011     1    0     1    NA
24               Barnet     7 2011     2    0     2    NA
25               Barnet     8 2011     1    0     1    NA

我的目标是通过为homesmonth == "2"的观察值分配不缺少的值来重新编码year == "2011"变量。如果没有对monthyear的这些值进行观察,则重新标记的观察结果将是与month == "1"year == "2011"对应的观察结果。理想情况下,预期的输出将是这样的:

> head(prueba, 25)
                     LA month year entry exit total homes
1  Barking and Dagenham    10 2010     2    0     2    NA
2  Barking and Dagenham    11 2010     3    0     3    NA
3  Barking and Dagenham    12 2010     3    0     3    NA
4  Barking and Dagenham     1 2011     6    0     6    NA
5  Barking and Dagenham     2 2011     1    0     1    15
6  Barking and Dagenham     3 2011     2    0     2    NA
7  Barking and Dagenham     4 2011     1    0     1    NA
8  Barking and Dagenham    10 2011     1    0     1    NA
9  Barking and Dagenham    11 2011     1    0     1    NA
10 Barking and Dagenham     1 2012     1    0     1    NA
11 Barking and Dagenham     9 2012     1    0     1    NA
12 Barking and Dagenham     6 2013     2    0     2    NA
13 Barking and Dagenham     1 2014     0    1    -1    NA
14 Barking and Dagenham    12 2014     0    1    -1    NA
15 Barking and Dagenham     3 2015     1    1     0    NA
16 Barking and Dagenham    11 2015     1    1     0    NA
17 Barking and Dagenham    12 2015     1    0     1    NA
18               Barnet    11 2010    24    0    24    NA
19               Barnet    12 2010    28    0    28    NA
20               Barnet     1 2011    28    0    28    NA
21               Barnet     2 2011     6    0     6    86
22               Barnet     3 2011     1    0     1    NA
23               Barnet     4 2011     1    0     1    NA
24               Barnet     7 2011     2    0     2    NA
25               Barnet     8 2011     1    0     1    NA

我已在以下基础上使用data.table来解决此问题:

test = data.table(prueba)
setkey(test, LA)
test$homes =test[, .SD[, ifelse(year == "2011" & month == "2", !is.na(homes), homes)], by=LA]

但它没有产生预期的产出。

> head(test, 25)
                      LA month year entry exit total homes
 1: Barking and Dagenham    10 2010     2    0     2    NA
 2: Barking and Dagenham    11 2010     3    0     3    NA
 3: Barking and Dagenham    12 2010     3    0     3    15
 4: Barking and Dagenham     1 2011     6    0     6    NA
 5: Barking and Dagenham     2 2011     1    0     1    NA
 6: Barking and Dagenham     3 2011     2    0     2    NA
 7: Barking and Dagenham     4 2011     1    0     1    NA
 8: Barking and Dagenham    10 2011     1    0     1    NA
 9: Barking and Dagenham    11 2011     1    0     1    NA
10: Barking and Dagenham     1 2012     1    0     1    NA
11: Barking and Dagenham     9 2012     1    0     1    NA
12: Barking and Dagenham     6 2013     2    0     2    NA
13: Barking and Dagenham     1 2014     0    1    -1    NA
14: Barking and Dagenham    12 2014     0    1    -1    NA
15: Barking and Dagenham     3 2015     1    1     0    NA
16: Barking and Dagenham    11 2015     1    1     0    NA
17: Barking and Dagenham    12 2015     1    0     1    NA
18:               Barnet    11 2010    24    0    24    NA
19:               Barnet    12 2010    28    0    28    86
20:               Barnet     1 2011    28    0    28    NA
21:               Barnet     2 2011     6    0     6    NA
22:               Barnet     3 2011     1    0     1    NA
23:               Barnet     4 2011     1    0     1    NA
24:               Barnet     7 2011     2    0     2    NA
25:               Barnet     8 2011     1    0     1    NA
                      LA month year entry exit total homes

如果有人可以提出替代方法,我将不胜感激 - 不一定是data.table

1 个答案:

答案 0 :(得分:1)

library(dplyr)
dfs <- data.frame(df %>% 
                  group_by(LA) %>% 
                  summarise(Homes = sum(homes, na.rm = T)) %>%
                  inner_join(.,df, by = 'LA') %>% 
                  mutate(Homes = ifelse(month == 2 & year == 2011, Homes, NA)))

这应该可以解决问题,并且使用dplyr包具有很高的速度,而不是迭代地执行(例如forwhile)。