我正在尝试使用一个数据集来清理另一个数据集。
我有一个数据帧,其中包含(人为错误的)错误输入的课程名称,称为MiscodedVisits
# A tibble: 3 x 3
EMAIL SemesterYear Course
<chr> <chr> <chr>
1 aap@fn.edu S16 CHM212
2 aar@fn.edu S14 PHY000
3 abc@fn.edu F17 PHY000
我有一个名为Rosters
的课程表数据框。
# A tibble: 5 x 3
EMAIL SemesterYear Course
<chr> <chr> <chr>
1 aap@fn.edu S17 CHM212
2 aap@fn.edu S16 CHM112
3 aar@fn.edu S14 PHY222
4 abc@fn.edu F17 AST300
5 abc@fn.edu F17 MAT255
我想查找Course
中错误编码的Rosters
(按EMAIL
和SemesterYear
),以便根据的部分匹配来添加CorrectedCourse
代表课程(CHM,PHY等)的Course
字符串
我想要的结果将具有MiscodedVisits外观:
# A tibble: 3 x 4
EMAIL SemesterYear Course CorrectedCourse
<chr> <chr> <chr> <chr>
1 aap@fn.edu S16 CHM212 CHM112
2 aar@fn.edu S14 PHY000 PHY222
3 abc@fn.edu F17 PHY000 NA
我尝试过:
A.根据CorrectedCourse
的字符串匹配,对MiscodedVisits
中的新列Rosters$Course
进行突变。 mutate(CorrectedCourse = DemoPerf$Course [match(EMAIL, DemoPerf$EMAIL) & match(SemesterYear, DemoPerf$SemesterYear)] )
由于语法Error in match(EMAIL, DemoPerf$EMAIL) : object 'EMAIL' not found
B。 fuzzy_inner_join (MiscodedVisits, Rosters, by= c(Course = "S\\d{2}"), match_fun = str_detect)
错误:Error: Column
col must be a 1d atomic vector or a list
C。 regex_inner_join (MiscodedVisits, Rosters, by= c(Course = "S\\d{2}"))
错误:Error: Column
col must be a 1d atomic vector or a list
答案 0 :(得分:0)
您可以使用dplyr
和stringr
library(stringr)
library(dplyr)
MiscodedVisits %>% mutate(code = str_extract(Course, "[A-Z]*")) %>%
left_join(Rosters %>% mutate(code = str_extract(Course, "[A-Z]*")),
by = c("EMAIL", "SemesterYear", "code")) %>% select(-code)
# EMAIL SemesterYear Course.x Course.y
#1 aap@fn.edu S16 CHM212 CHM112
#2 aar@fn.edu S14 PHY000 PHY222
#3 abc@fn.edu F17 PHY000 <NA>