同时进行大约文本匹配和更新

时间:2018-03-17 10:01:28

标签: r multiple-columns textmatching

我有一个df1数据框,其中包含一个名为University_name的列名为University_name,并且行数为500000。现在我有另一个数据框作为df2,它包含2列作为university_name和university_aliases,有150行。现在我想将university_aliases列中的每个大学别名与university_name_new中的大学名称相匹配。

df1 $ university_name的样本

university of auckland
the university of auckland
university of warwick - warwick business school
unv of warwick
seneca college of applied arts and technology
seneca college
univ of auckland

df2的样本

University_Alias                  Univeristy_Name_new

univ of auckland                  university of auckland
universiry of auckland            university of auckland
auckland university               university of auckland
university of auckland            university of auckland
warwick university                university of warwick
warwick univercity                university of warwick
university of warwick             university of warwick
seneca college                    seneca college
unv of warwick                    university of warwick

我期待像这样的输出

university of auckland
university of auckland
university of warwick
seneca college
seneca college

我正在使用以下代码,但它无效

 df$university_name[ grepl(df$university_name,df2$university_alias)] <- df2$university_name_new

2 个答案:

答案 0 :(得分:0)

你可以这样做

df2$University_Name_new[which(is.element(df2$University_Alias, df1$university_name))]
### which returns the following ####
[1] "university of auckland" "seneca college" 

现在,例如,您提供的数据the university of auckland位于df1$university_name但未位于df2$University_Alias,这就是我们拥有以下内容的原因:

> which(is.element(df2$University_Alias, df1$university_name))
[1] 4 8

实际上,df1$university_name来自university of aucklandseneca college只包含df2$University_Aliasdocument.querySelectorAll

答案 1 :(得分:0)

您可以使用sapplystr_extract来获得所需的结果。

 # create sample data
df1 <- data.frame(university_name = c('university of auckland',
                                      'the university of auckland',
                                      'university of warwick - warwick business school',
                                      'seneca college of applied arts and technology',
                                      'seneca college'), stringsAsFactors = F)

# these are values to match (from df2)
vals <- c('university of auckland','university of warwick','seneca college')

# get the output
df1$output <- sapply(df1$university_name, function(z)({

    f <- vals[complete.cases(str_extract(string = z, pattern = vals))]
    return(f)

}), USE.NAMES = F)

print(df1)

                                  university_name                 output
1                          university of auckland university of auckland
2                      the university of auckland university of auckland
3 university of warwick - warwick business school  university of warwick
4   seneca college of applied arts and technology         seneca college
5                                  seneca college         seneca college

<强>更新

根据我的理解,df2已经有university_aliasuniversity_name_new的一对一映射,因此问题归结为检查df1中是否存在university_alias,我们删除它。

# check values for university_alias in university_name
maps2 <- as.character(df2$university_alias[which(df2$university_alias %in% df1$university_name)])

# remove unmatched rows from df2
df3 <- df2[df2$university_alias %in% maps2,]

print(df3)
            university_alias    university_name_new
1           univ of auckland university of auckland
4     university of auckland university of auckland
8             seneca college         seneca college
9             unv of warwick  university of warwick