如何根据另一数据框中的部分字符串匹配将列添加到一个数据框中?

时间:2019-05-07 22:07:30

标签: r string dplyr

我正在尝试使用一个数据集来清理另一个数据集。

我有一个数据帧,其中包含(人为错误的)错误输入的课程名称,称为MiscodedVisits

# A tibble: 3 x 3
  EMAIL      SemesterYear Course
  <chr>      <chr>        <chr> 
1 aap@fn.edu S16          CHM212
2 aar@fn.edu S14          PHY000
3 abc@fn.edu F17          PHY000

我有一个名为Rosters的课程表数据框。

# A tibble: 5 x 3
  EMAIL      SemesterYear Course
  <chr>      <chr>        <chr> 
1 aap@fn.edu S17          CHM212
2 aap@fn.edu S16          CHM112
3 aar@fn.edu S14          PHY222
4 abc@fn.edu F17          AST300
5 abc@fn.edu F17          MAT255

我想查找Course中错误编码的Rosters(按EMAILSemesterYear),以便根据的部分匹配来添加CorrectedCourse代表课程(CHM,PHY等)的Course字符串

我想要的结果将具有MiscodedVisits外观:

# A tibble: 3 x 4
  EMAIL      SemesterYear Course CorrectedCourse
  <chr>      <chr>        <chr>  <chr>          
1 aap@fn.edu S16          CHM212 CHM112         
2 aar@fn.edu S14          PHY000 PHY222         
3 abc@fn.edu F17          PHY000 NA 

我尝试过: A.根据CorrectedCourse的字符串匹配,对MiscodedVisits中的新列Rosters$Course进行突变。 mutate(CorrectedCourse = DemoPerf$Course [match(EMAIL, DemoPerf$EMAIL) & match(SemesterYear, DemoPerf$SemesterYear)] ) 由于语法Error in match(EMAIL, DemoPerf$EMAIL) : object 'EMAIL' not found

而失败

B。 fuzzy_inner_join (MiscodedVisits, Rosters, by= c(Course = "S\\d{2}"), match_fun = str_detect)错误:Error: Column col must be a 1d atomic vector or a list

C。 regex_inner_join (MiscodedVisits, Rosters, by= c(Course = "S\\d{2}"))错误:Error: Column col must be a 1d atomic vector or a list

1 个答案:

答案 0 :(得分:0)

您可以使用dplyrstringr

library(stringr)
library(dplyr)

MiscodedVisits %>% mutate(code = str_extract(Course, "[A-Z]*")) %>%
  left_join(Rosters %>% mutate(code = str_extract(Course, "[A-Z]*")), 
            by = c("EMAIL", "SemesterYear", "code")) %>% select(-code)

#       EMAIL SemesterYear Course.x Course.y
#1 aap@fn.edu          S16   CHM212   CHM112
#2 aar@fn.edu          S14   PHY000   PHY222
#3 abc@fn.edu          F17   PHY000     <NA>