使用ID

时间:2018-01-03 18:02:50

标签: r data-manipulation recode

我有两个数据帧,每个数据帧包含相同的变量和每个观察的唯一ID。

df.1是一个大型数据集,其中包含由NA表示的缺失值。这些缺失条目的值包含在df.2中,我想通过匹配id来替换df.1中的缺失和df.2中的值。

我还没有在这里找到类似的问题,考虑到它们都是因子变量。

为了简单起见:如果id匹配 - df.1中的缺失值应替换为df.2中的factor值。

df.1 <- data.frame(id = c(334,440,501,2304,2500), 
                v1 = c("4 dogs",NA,"3 dogs",NA,"No dogs"))

df.2 <- data.frame(id = c(440,2304), 
                v2 = c("4 dogs","5 dogs"))

非常感谢您的帮助。

4 个答案:

答案 0 :(得分:2)

正如@Gregor所提到的,你可以将df转换回因子。这里的便利功能是@MrFlick的coalesce功能。解决方案是不言自明的

library(dplyr)

df.1 %>%
  left_join(df.2, by = "id") %>%
  mutate_if(is.factor, as.character) %>%
  mutate(final = coalesce(v1, v2))  %>% mutate_if(is.character, as.factor)

<强> 输出

   id      v1     v2   final
1  334  4 dogs   <NA>  4 dogs
2  440    <NA> 4 dogs  4 dogs
3  501  3 dogs   <NA>  3 dogs
4 2304    <NA> 5 dogs  5 dogs
5 2500 No dogs   <NA> No dogs

将上述结果存储在变量(df)中,然后检查str(df)

'data.frame':   5 obs. of  4 variables:
 $ id   : num  334 440 501 2304 2500
 $ v1   : Factor w/ 3 levels "3 dogs","4 dogs",..: 2 NA 1 NA 3
 $ v2   : Factor w/ 2 levels "4 dogs","5 dogs": NA 1 NA 2 NA
 $ final: Factor w/ 4 levels "3 dogs","4 dogs",..: 2 2 1 3 4

如果您要删除v1v2列,只需将最终结果发送到%>% select(id,final)

希望它有效。

答案 1 :(得分:0)

您可以加入df.1df.2,以便在合并的v1中保留v2data.frame。运行逻辑用值v1替换丢失的v2

library(dplyr)

df.1 <- data.frame(id = c(334,440,501,2304,2500), 
                   v1 = c("4 dogs",NA,"3 dogs",NA,"No dogs"))

df.2 <- data.frame(id = c(440,2304), 
                   v2 = c("4 dogs","5 dogs"))
#merge using left_join to keep all rows from df.1
final <- df.1 %>%
  left_join(df.2, by = "id")
#> final
#    id      v1     v2
#1  334  4 dogs   <NA>
#2  440    <NA> 4 dogs
#3  501  3 dogs   <NA>
#4 2304    <NA> 5 dogs
#5 2500 No dogs   <NA>

#Define a function to replace missing v1
replMissing <- function(x, y){
  ifelse(is.na(x), y, x )
}

#call replMissing function using mapply. Modified to handle factor
final$v1 <- as.factor(mapply(replMissing, as.character(final$v1), as.character(final$v2)))

#results is
#> final
#    id      v1     v2
#1  334  4 dogs   <NA>
#2  440  4 dogs 4 dogs
#3  501  3 dogs   <NA>
#4 2304  5 dogs 5 dogs
#5 2500 No dogs   <NA>

现在可以删除v2

答案 2 :(得分:0)

使用data.tabledplyr: -

library(data.table)
library(dplyr)
df <- left_join(df.1, df.2, by = "id")
setDT(df)
df[is.na(v1), v1 := v2]
df[, v2 := NULL]

您将获得所需的输出: -

     id      v1
1:  334  4 dogs
2:  440  4 dogs
3:  501  3 dogs
4: 2304  5 dogs
5: 2500 No dogs

直到这一点id为数字,v1才是数字。如果您希望id也转换为因子。您可以使用以下方式执行此操作: -

df[, id := as.factor(id)]

答案 3 :(得分:0)

使用tidyverse方法,您有两种解决方案:

第一个解决方案:

library(dplyr)
df.1 <- data.frame(id = c(334,440,501,2304,2500), 
                   v1 = c("4 dogs",NA,"3 dogs",NA,"No dogs"),stringsAsFactors=F) 

df.2 <- data.frame(id = c(440,2304), 
                   v2 = c("4 dogs","5 dogs"),stringsAsFactors=F) %>% 
    rename(v1=v2)

df_mix <- bind_rows(df.1,df.2) %>% 
    drop_na(...=v1)

第二个解决方案:

df.1 <- data.frame(id = c(334,440,501,2304,2500), 
                   v1 = c("4 dogs",NA,"3 dogs",NA,"No dogs"),stringsAsFactors=F)

df.2 <- data.frame(id = c(440,2304), 
                   v2 = c("4 dogs","5 dogs"),stringsAsFactors=F) 

df_mix <- left_join(df.1,df.2,by="id") %>% 
    mutate(v1=case_when(
        is.na(v1) ~ v2,
        !is.na(v1) ~ v1
    )) %>% 
    select(1:2)

PS:我更喜欢将字符串作为字符向量