我有一个df1数据框,其中包含一个名为University_name的列名为University_name,并且行数为500000。现在我有另一个数据框作为df2,它包含2列作为university_name和university_aliases,有150行。现在我想将university_aliases列中的每个大学别名与university_name_new中的大学名称相匹配。
df1 $ university_name的样本
university of auckland
the university of auckland
university of warwick - warwick business school
unv of warwick
seneca college of applied arts and technology
seneca college
univ of auckland
df2的样本
University_Alias Univeristy_Name_new
univ of auckland university of auckland
universiry of auckland university of auckland
auckland university university of auckland
university of auckland university of auckland
warwick university university of warwick
warwick univercity university of warwick
university of warwick university of warwick
seneca college seneca college
unv of warwick university of warwick
我期待像这样的输出
university of auckland
university of auckland
university of warwick
seneca college
seneca college
我正在使用以下代码,但它无效
df$university_name[ grepl(df$university_name,df2$university_alias)] <- df2$university_name_new
答案 0 :(得分:0)
你可以这样做
df2$University_Name_new[which(is.element(df2$University_Alias, df1$university_name))]
### which returns the following ####
[1] "university of auckland" "seneca college"
现在,例如,您提供的数据the university of auckland
位于df1$university_name
但未位于df2$University_Alias
,这就是我们拥有以下内容的原因:
> which(is.element(df2$University_Alias, df1$university_name))
[1] 4 8
实际上,df1$university_name
来自university of auckland
,seneca college
只包含df2$University_Alias
和document.querySelectorAll
。
答案 1 :(得分:0)
您可以使用sapply
和str_extract
来获得所需的结果。
# create sample data
df1 <- data.frame(university_name = c('university of auckland',
'the university of auckland',
'university of warwick - warwick business school',
'seneca college of applied arts and technology',
'seneca college'), stringsAsFactors = F)
# these are values to match (from df2)
vals <- c('university of auckland','university of warwick','seneca college')
# get the output
df1$output <- sapply(df1$university_name, function(z)({
f <- vals[complete.cases(str_extract(string = z, pattern = vals))]
return(f)
}), USE.NAMES = F)
print(df1)
university_name output
1 university of auckland university of auckland
2 the university of auckland university of auckland
3 university of warwick - warwick business school university of warwick
4 seneca college of applied arts and technology seneca college
5 seneca college seneca college
<强>更新强>
根据我的理解,df2
已经有university_alias
与university_name_new
的一对一映射,因此问题归结为检查df1中是否存在university_alias,我们删除它。
# check values for university_alias in university_name
maps2 <- as.character(df2$university_alias[which(df2$university_alias %in% df1$university_name)])
# remove unmatched rows from df2
df3 <- df2[df2$university_alias %in% maps2,]
print(df3)
university_alias university_name_new
1 univ of auckland university of auckland
4 university of auckland university of auckland
8 seneca college seneca college
9 unv of warwick university of warwick