根据与另一个表的关系填充缺失值

时间:2019-08-11 00:15:53

标签: r data.table

我有两个数据表,city_popcity_subcity_pop是具有平均人口数但缺少某些值的城市的列表。 city_sub表提供了两个可能的city_idsub_1sub_2),其中的avg_pop可用于在NA中填充city_popsub_1sub_2将按照该优先顺序使用。仅NA中的avg_pop值需要替换。

如何在不使用for循环的情况下做到这一点?

city_id = c(1, 2, 3, 4, 5, 6)
avg_pop = c(100, NA, NA, 300, 400, NA)

city_pop = data.table(city_id, avg_pop)

   city_id avg_pop
1:       1     100
2:       2      NA
3:       3      NA
4:       4     300
5:       5     400
6:       6      NA

sub_1=c(2,1,4,3,1,3)
sub_2=c(5,5,6,6,2,4)

city_sub =data.table(city_id,sub_1,sub_2)

   city_id sub_1 sub_2
1:       1     2     5
2:       2     1     5
3:       3     4     6
4:       4     3     6
5:       5     1     2
6:       6     3     4

预期输出-

  city_id avg_pop
1       1     100
2       2     100
3       3     300
4       4     300
5       5     400
6       6     300

3 个答案:

答案 0 :(得分:3)

这是dplyr使用coalesce的一种方式,该方式使用第一个非NA值。我创建了一个单独的列avg_pop2,因为在这种情况下,此列似乎更安全,而且可以轻松验证结果。

city_pop %>% 
  left_join(city_sub, by = "city_id") %>% 
  mutate(
    avg_pop2 = coalesce(
      avg_pop, avg_pop[match(sub_1, city_id)], avg_pop[match(sub_2, city_id)]
    )
  )

  city_id avg_pop sub_1 sub_2 avg_pop2
1       1     100     2     5      100
2       2      NA     1     5      100
3       3      NA     4     6      300
4       4     300     3     6      300
5       5     400     1     2      400
6       6      NA     3     4      300

答案 1 :(得分:1)

一种方法是查找sub_1,然后查找其avg_pop;然后对sub_2做同样的事情:

city_pop[is.na(avg_pop), avg_pop :=  
  city_pop[.(city_sub[.SD, on=.(city_id), x.sub_1]), on=.(city_id), x.avg_pop]
]
city_pop[is.na(avg_pop), avg_pop := 
  city_pop[.(city_sub[.SD, on=.(city_id), x.sub_2]), on=.(city_id), x.avg_pop]
]

这种方法有些复杂,不适用于更一般的示例。图论方法可能更有意义,例如,如果city_sub看起来像这样:

   city_id sub_1 
1:       1     5 
5:       5     3 

假设1和5都缺少数据。我们希望看到5填充3,然后1填充5,但这需要知道填充的顺序。我想,使用有向图,您可以找出正确的遍历顺序,尽管我没有仔细考虑所有细节。

答案 2 :(得分:1)

另一种可能的方法是将city_sub转换为长格式,并在使用滚动连接之前将city_id调整到小数位:

          #convert into long format
newpop <- melt(city_sub, measure.vars=patterns("^sub_"), variable.factor=FALSE)[,
    #tweak the city_id slightly to show order of preference
    city_id := as.numeric(paste0(city_id, ".", substring(variable, nchar(variable))))][
        #look up average population
        city_pop, on=.(value=city_id), new_pop := i.avg_pop][
            #remove cities without population
            !is.na(new_pop)]
newpop
#   city_id variable value new_pop
#1:     2.1    sub_1     1     100
#2:     3.1    sub_1     4     300
#3:     5.1    sub_1     1     100
#4:     1.2    sub_2     5     400
#5:     2.2    sub_2     5     400
#6:     6.2    sub_2     4     300

#rolling join
city_pop[is.na(avg_pop), avg_pop :=
        newpop[copy(.SD), on=.(city_id), roll=-Inf, x.new_pop]]

输出:

   city_id avg_pop
1:       1     100
2:       2     100
3:       3     300
4:       4     300
5:       5     400
6:       6     300

数据:

library(data.table)
city_pop = data.table(city_id=1:6, avg_pop=c(100, NA, NA, 300, 400, NA))
city_sub = data.table(city_id=1:6, sub_1=c(2,1,4,3,1,3), sub_2=c(5,5,6,6,2,4))