使用其他数据填写缺失值?

时间:2017-08-29 11:34:24

标签: r missing-data

A <- data.frame(Item_A = c("00EF", "00EF", "00EF", "00EF", "00EF", "00FR", "00FR"),  
                Item_B = c(NA, NA, NA, NA, "JAMES RIVER", NA, NA))

B <- data.frame(Item_A = c("00EF", "00EF", "00EF", "00FR", "00FR"), 
                Item_B = c("JAMES RIVER", NA, "JAMES RIVER",
                           "RICE MIDSTREAM", "RICE MIDSTREAM"))

预期:

A <- data.frame(Item_A = c("00EF", "00EF", "00EF", "00EF", "00EF", "00FR", "00FR"),  
                Item_B = c("JAMES RIVER", "JAMES RIVER", "JAMES RIVER", 
                         "JAMES RIVER", "JAMES RIVER", "RICE MIDSTREAM", "RICE MIDSTREAM"))

B <- data.frame(Item_A = c("00EF", "00EF", "00EF", "00FR", "00FR"), 
                Item_B = c("JAMES RIVER", "JAMES RIVER", "JAMES RIVER", 
                           "RICE MIDSTREAM", "RICE MIDSTREAM"))

我必须根据Item_B相同的其他行的Item_B填写项Item_A。例如,数据集Item_BA的第一到第四次观察需要成为“JAMES RIVER”。

您能否建议一种方法来填写R中的缺失值?我尝试了很多技巧,却无法得到我想要的东西。

4 个答案:

答案 0 :(得分:3)

据我所知,这个只是在每个data.frame的一列中填充缺失值的练习。我认为这需要在查找或映射表的帮助下填写属于Item_B的{​​{1}}的值:

Item_A
library(data.table)
# create mapping table from both data.frames
map <- unique(rbindlist(list(A, B)))[!is.na(Item_B)]
# or, in case there are additional columns besides Item_A and Item_B
map <- unique(rbindlist(list(A, B))[!is.na(Item_B), .(Item_A, Item_B)])
map
   Item_A         Item_B
1:   00FF    JAMES RIVER
2:   00EF    JAMES RIVER
3:   00FR RICE MIDSTREAM
# join and replace
setDT(A)[map, on = c("Item_A"), Item_B := i.Item_B][]
   Item_A         Item_B
1:   00FF    JAMES RIVER
2:   00FF    JAMES RIVER
3:   00FF    JAMES RIVER
4:   00FF    JAMES RIVER
5:   00FF    JAMES RIVER
6:   00FR RICE MIDSTREAM
7:   00FR RICE MIDSTREAM
setDT(B)[map, on = c("Item_A"), Item_B := i.Item_B][]

在加入期间,有两列名为 Item_A Item_B 1: 00EF JAMES RIVER 2: 00EF JAMES RIVER 3: 00EF JAMES RIVER 4: 00FR RICE MIDSTREAM 5: 00FR RICE MIDSTREAM ,一列来自第一个数据表,Item_B(或A,resp。),另一列来自第二个数据表{{ 1}}。为区分它们,B前缀表示map应取自i.

答案 1 :(得分:2)

您可以尝试创建一个字典数据框。

library(dplyr)
dictionnary <- bind_rows(A,B) %>% 
           filter(!is.na(Item_B)) %>% 
           distinct
find_name <- function(id){
  name <- dictionnary[["Item_B"]][which(dictionnary[["Item_A"]]==id)]
  return(name)
}
test_id <- c("00EF","00EF","00EF","00FR","00FR")
new_names <- sapply(test_id ,find_name )

然后您可以声明您的数据框:

New_A <- data.frame(Item_A=c("00FF","00FF","00FF","00FF","00FF","00FR","00FR"),
                Item_B=sapply(c("00FF","00FF","00FF","00FF","00FF","00FR","00FR"),find_name))

New_B <- data.frame(Item_A=c("00EF","00EF","00EF","00FR","00FR"), 
                Item_B=sapply(c("00EF","00EF","00EF","00FR","00FR"),find_name))

答案 2 :(得分:1)

您可以尝试使用tidyr library helper fill

library(tidyr)
A %>% 
  tidyr::fill(Item_B, .direction = "down") %>% 
  tidyr::fill(Item_B, .direction = "up")

  Item_A      Item_B
1   00FF JAMES RIVER
2   00FF JAMES RIVER
3   00FF JAMES RIVER
4   00FF JAMES RIVER
5   00FF JAMES RIVER
6   00FR JAMES RIVER
7   00FR JAMES RIVER

答案 3 :(得分:0)

@YXCHEN根据您的输入进行更新

lookup_df <- unique(rbindlist(list(A, B)))[!is.na(Item_B)] 

left_join(A %>% select(Item_A), lookup_df)
left_join(B %>% select(Item_A), lookup_df)