Question

如何在R中合并两个数据表（或数据帧），以保持每个匹配列中的非NA值？问题Merge data frames and overwrite values提供了一个解决方案，如果明确指定了每个单独的列（至少据我所知）。但是，我在两个数据表之间有40多个公共列，并且这两个表中的哪个具有一个NA与有效值之间有些随机。因此，为40列编写ifelse语句似乎效率很低。

下面是一个简单的示例，在这里我想通过merge和data.table列将两个id连接起来（date：

dt_1 <- data.table::data.table(id = "abc",
                               date = "2018-01-01",
                               a = 3, 
                               b = NA_real_,
                               c = 4, 
                               d = 6,
                               e = NA_real_)
setkey(dt_1, id, date)

> dt_1
    id       date a  b c d  e
1: abc 2018-01-01 3 NA 4 6 NA

dt_2 <- data.table::data.table(id = "abc", 
                               date = "2018-01-01",
                               a = 3, 
                               b = 5,
                               c = NA_real_, 
                               d = 6,
                               e = NA_real_)
setkey(dt_2, id, date)
> dt_2
    id       date a b  c d  e
1: abc 2018-01-01 3 5 NA 6 NA

这是我想要的输出：

> dt_out
    id       date a b c d  e
1: abc 2018-01-01 3 5 4 6 NA

我还尝试了left_join two data frames and overwrite中的dplyr::anti_join解决方案，但没有成功。

Answer 1

我可能会以长格式放置数据并删除重复项：

k = key(dt_1)
DTList = list(dt_1, dt_2)

DTLong = rbindlist(lapply(DTList, function(x) melt(x, id=k)))    
setorder(DTLong, na.last = TRUE)    
unique(DTLong, by=c(k, "variable"))

    id       date variable value
1: abc 2018-01-01        a     3
2: abc 2018-01-01        b     5
3: abc 2018-01-01        c     4
4: abc 2018-01-01        d     6
5: abc 2018-01-01        e    NA

Answer 2

您可以使用dplyr::coalesce来执行此操作，它将返回向量中的第一个非缺失值。

（编辑：您也可以直接在数据帧上使用dplyr::coalesce，而无需创建下面的函数。为了完整起见，请留在此处作为原始答案的记录。）

应付款：该代码主要来自this blog post，它构建了一个函数，该函数将接收两个数据帧并执行您需要的操作（如果存在，则从x数据帧中获取值））。

coalesce_join <- function(x, 
                          y, 
                          by, 
                          suffix = c(".x", ".y"), 
                          join = dplyr::full_join, ...) {
    joined <- join(x, y, by = by, suffix = suffix, ...)
    # names of desired output
    cols <- union(names(x), names(y))

    to_coalesce <- names(joined)[!names(joined) %in% cols]
    suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
    # remove suffixes and deduplicate
    to_coalesce <- unique(substr(
        to_coalesce, 
        1, 
        nchar(to_coalesce) - nchar(suffix_used)
    ))

    coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
        joined[[paste0(.x, suffix[1])]], 
        joined[[paste0(.x, suffix[2])]]
    ))
    names(coalesced) <- to_coalesce

    dplyr::bind_cols(joined, coalesced)[cols]
}

Answer 3

我们可以使用我的软件包safejoin，进行左连接，并使用dplyr::coalesce

处理冲突

# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)

safe_left_join(dt_1, dt_2, by = "id", conflict = coalesce)
#    id       date a b c d  e
# 1 abc 2018-01-01 3 5 4 6 NA

合并R数据帧或数据表并覆盖多列的值

3 个答案: