在多个列上执行左连接,其中一个是部分字符串

时间:2018-04-19 19:36:31

标签: r dplyr

我试图加入a)名字的前两个字符,b)姓氏和c)年份。我正在做一些关于fuzzyjoin的阅读,但它看起来并不像我需要的那样。

我已经尝试了

newly_joined_df <- names_df %>%
    left_join(values_df, by = c(substr("first_name", 1, 2), "last_name", "year")

并且

newly_joined_df <- names_df %>%
    left_join(values_df, by = c(substr(names_df$first_name, 1, 2), "last_name", "year")

但两者都是愚蠢的解决方案,并且犯了明显的错误。

1 个答案:

答案 0 :(得分:1)

这个怎么样?

library(dplyr)

df1 %>%
  mutate(first_name_1st2char = substr(first_name, 1, 2)) %>%
  left_join(df2 %>% mutate(first_name_1st2char = substr(first_name, 1, 2)), 
            by = c("first_name_1st2char", "last_name", "year")) %>%
  select(-first_name_1st2char)

输出为:

  first_name.x last_name year first_name.y age
1         john      asdf 2018          joe  12
2         jack    qwerty 2017         jake  34

示例数据:

df1 <- structure(list(first_name = structure(c(2L, 1L), .Label = c("jack", 
"john"), class = "factor"), last_name = structure(1:2, .Label = c("asdf", 
"qwerty"), class = "factor"), year = c(2018, 2017)), .Names = c("first_name", 
"last_name", "year"), row.names = c(NA, -2L), class = "data.frame")

df2 <- structure(list(first_name = structure(c(3L, 2L, 1L), .Label = c("donald", 
"jake", "joe"), class = "factor"), last_name = structure(c(1L, 
3L, 2L), .Label = c("asdf", "jong", "qwerty"), class = "factor"), 
    year = c(2018, 2017, 2018), age = c(12, 34, 5)), .Names = c("first_name", 
"last_name", "year", "age"), row.names = c(NA, -3L), class = "data.frame")