在连接并删除重复项后合并重复列的非NA值时,我经常遇到问题。它类似于所描述的in this question或this one。我想在coalesce
(并可能包括left_join
)周围创建一个小函数,以便每当遇到它时都可以一行处理(当然,该函数本身可以根据需要的长度而定)。>
在这样做时,我遇到了quo_names
所描述的here等价的quos
。
对于reprex,请使用包含标识信息的数据框,将其与包含正确值但通常拼写错误的ID的其他信息合并。
library(dplyr)
library(rlang)
iris_identifiers <- iris %>%
select(contains("Petal"), Species)
iris_alt_name1 <- iris %>%
mutate(Species = recode(Species, "setosa" = "stosa"))
iris_alt_name2 <- iris %>%
mutate(Species = recode(Species, "versicolor" = "verscolor"))
此更简单的功能有效:
replace_xy <- function(df, var) {
x_var <- paste0(var, ".x")
y_var <- paste0(var, ".y")
df %>%
mutate(!! quo_name(var) := coalesce(!! sym(x_var), !! sym(y_var))) %>%
select(-(!! sym(x_var)), -(!! sym(y_var)))
}
iris_full <- iris_identifiers %>%
left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>%
left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width")) %>%
replace_xy("Sepal.Length") %>%
replace_xy("Sepal.Width")
head(iris_full)
#> Petal.Length Petal.Width Species Sepal.Length Sepal.Width
#> 1 1.4 0.2 setosa 5.1 3.5
#> 2 1.4 0.2 setosa 4.9 3.0
#> 3 1.4 0.2 setosa 5.0 3.6
#> 4 1.4 0.2 setosa 4.4 2.9
#> 5 1.4 0.2 setosa 5.2 3.4
#> 6 1.4 0.2 setosa 5.5 4.2
但是我对如何实现几个变量的概括有些迷惑,我认为这将是更容易的部分。下面的代码段只是一次绝望的尝试-在尝试了多种变体之后-大致捕获了我要实现的目标。
replace_many_xy <- function(df, vars) {
x_var <- paste0(vars, ".x")
y_var <- paste0(vars, ".y")
df %>%
mutate_at(vars(vars), funs(replace_xy(.data, .))) %>%
select(-(!!! syms(x_var)), -(!!! syms(y_var)))
}
new_cols <- colnames(iris_alt_name1)
diff_cols <- new_cols [!(new_cols %in% colnames(iris_identifiers))]
iris_full <- iris_identifiers %>%
left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>%
left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width")) %>%
replace_many_xy(diff_cols)
#> Warning: Column `Species` joining factors with different levels, coercing
#> to character vector
#> Warning: Column `Species` joining character vector and factor, coercing
#> into character vector
#> Error: Unknown columns `Sepal.Length` and `Sepal.Width`
任何帮助将不胜感激!
答案 0 :(得分:2)
我写了a package that does just that,现在应该稳定了。
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
iris_full <- iris_identifiers %>%
left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>%
safe_left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width"), conflict = coalesce) %>%
head
iris_full
# Petal.Length Petal.Width Species Sepal.Length Sepal.Width
# 1 1.4 0.2 setosa 5.1 3.5
# 2 1.4 0.2 setosa 4.9 3.0
# 3 1.4 0.2 setosa 5.0 3.6
# 4 1.4 0.2 setosa 4.4 2.9
# 5 1.4 0.2 setosa 5.2 3.4
# 6 1.4 0.2 setosa 5.5 4.2
safe_left_join
是经过改进的left_join
,允许通过
check
参数以及通过conflict
参数处理列冲突的一些方法,就像我们在这里所做的那样。
conflict
参数是一个函数,该函数一个接一个地处理一对冲突的列,以从您需要的conflict = ~coalesce(.y, .x)
右边合并。
这是使函数正常工作的一种方法:
replace_many_xy <- function(tbl, vars){
for(var in vars){
x <- paste0(var,".x")
y <- paste0(var,".y")
tbl <- mutate(tbl, !!sym(var) := coalesce(!!sym(x) , !!sym(y) )) %>%
select(-one_of(x,y))
}
tbl
}
iris_full <- iris_identifiers %>%
left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>%
left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width")) %>%
replace_many_xy(diff_cols) %>% as_tibble()
# # A tibble: 372 x 5
# Petal.Length Petal.Width Species Sepal.Length Sepal.Width
# <dbl> <dbl> <chr> <dbl> <dbl>
# 1 1.4 0.2 setosa 5.1 3.5
# 2 1.4 0.2 setosa 4.9 3
# 3 1.4 0.2 setosa 5 3.6
# 4 1.4 0.2 setosa 4.4 2.9
# 5 1.4 0.2 setosa 5.2 3.4
# 6 1.4 0.2 setosa 5.5 4.2
# 7 1.4 0.2 setosa 4.6 3.2
# 8 1.4 0.2 setosa 5 3.3
# 9 1.4 0.2 setosa 5.1 3.5
# 10 1.4 0.2 setosa 4.9 3
# # ... with 362 more rows