我想基于“答案”列对2个数据帧(s1是数据,s2是参考)进行模糊匹配,以便从s2获得相应的问题计数和类别。 例如:
s1 <- data.frame(Category =c("Stationary","TransferRelocationClaim","IMS"),
Question =c( "Where do I get stationary items from?","Process for claiming Transfer relocation allowances.","What is IMS?"),Answer = c("Hey <firstname>, you will find it near helpdesk ","Hey <firstname>, moving to new places can be fun! To claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon).","ims or interview management system is a tool that helps interviewers schedule all the interviews"),
stringsAsFactors = FALSE)
s2 <- data.frame(
Question = c("Where to get books?", "Procedure to order stationary?","I would like to know about my relocation and relocation expenses","tell me about relocation expense claiming","how to claim relocation expense","IMS?"),
Answer = c("Hey Anil, you will find it at the helpdesk.", "Hey, Shekhar, you will find it at the helpdesk.", "hey sonali moving to new places can be fun! to claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon)","hey piyush moving to new places can be fun! to claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,assignments ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon). 3. attach the bills to the printout and secure approval sign-off / mail (from the pa support for new joinee relocation claims and the portal approver for existing employees). 4. drop the bills in the portal drop box (the duty manager amp, finance team can confirm the coordinates.", "hey vibha moving to new places can be fun! to claim relocation expense please follow the steps given below- 1. request you to add the code in https://portal.mycompany.com ,assignments ,enter relocation code ,add. 2. select expenses ,add expense ,other expense ,fill the form ,save ,print (select the print icon). 3. attach the bills to the printout and secure approval sign-off / mail from the pa support for new joinee relocation claims and the portal approver for existing employees). 4. drop the bills in the portal drop box (the duty manager amp, finance team can confirm the coordinates", "ims or interview management system is a tool that helps interviewers schedule all the interviews")
stringsAsFactors = FALSE)
s1$Response=gsub('[[:punct:] ]+',' ',s1$Response)
s2$Response=gsub('[[:punct:] ]+',' ',s2$Response)
s1$Response <- tolower(s1$Response)
s2$Response <- tolower(s2$Response)
s1$Response<-as.character(s1$Response)
s2$Response<-as.character(s2$Response)
# data =s1, lookup=s2
d.matrix <- stringdistmatrix(a = s2$Response, b = s1$Response, useNames="strings",method="cosine", nthread = getOption("sd_num_thread"))
#list of minimun cosines
cosines<-apply(d.matrix, 2, min)
#return list of the row number of the minimum value
minlist<-apply(d.matrix, 2, which.min)
#return list of best matching values
matchwith<-s2$Response[minlist]
#below table contains best match and cosines
answer<-data.frame(s1$Response, matchwith, cosines)
t11=merge(x=answer,y=s2, by.x="matchwith", by.y="Response", all.x=TRUE)
View(t11)`
答案 0 :(得分:0)
您可以尝试使用agrepl
函数进行匹配,该函数可让您设置最大“距离”,即“从模式到目标所需的转换的总和。”我将取出侧面的内容sub
中的尖括号:
agrepl(sub("<.+>, ", "", df1$Answer), df2$Answer, 8)
[1] TRUE TRUE FALSE
(注意:FALSE来自我修改了第二个数据框,使其具有不匹配的“答案”值。
答案 1 :(得分:0)
如果我们稍微修改您的第一个输入,我们可以通过以下方式使用包 fuzzyjoin
/ dplyr
/ stringr
df1 <- data.frame(
Category = "Stationary",
Question = "Where do I get stationary items from?",
Answer = "Hey <firstname>, you will find it <here>.", # <-notice the change!
stringsAsFactors = FALSE
)
df2 <- data.frame(
Category = c("Stat1", "Stat1"),
Question = c("Where to get books?", "Procedure to order stationary?"),
Answer = c("Hey Anil, you will find it at the helpdesk.", "Hey, Shekhar, you will find it at the helpdesk."),
stringsAsFactors = FALSE
)
我们从Answer
创建一个正则表达式模式:
df1 <- dplyr::mutate(
df1,
Answer_regex =gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", Answer), # escape special
Answer_regex = gsub(" *?<.*?> *?",".*?", Answer_regex), # replace place holders by .*?
Answer_regex = paste0("^",Answer_regex,"$")) # make sure the match is exact
我们将stringr::str_detect
与fuzzyjoin::fuzzy_left_join
配合使用来查找匹配项:
res <- fuzzyjoin::fuzzy_left_join(df2, df1, by= c(Answer="Answer_regex"), match_fun = stringr::str_detect )
res
# Category.x Question.x Answer.x Category.y
# 1 Stat1 Where to get books? Hey Anil, you will find it at the helpdesk. Stationary
# 2 Stat1 Procedure to order stationary? Hey, Shekhar, you will find it at the helpdesk. Stationary
# Question.y Answer.y Answer_regex
# 1 Where do I get stationary items from? Hey <firstname>, you will find it <here>. ^Hey.*?, you will find it.*?\\.$
# 2 Where do I get stationary items from? Hey <firstname>, you will find it <here>. ^Hey.*?, you will find it.*?\\.$
那么我们可以数:
dplyr::count(res,Answer.y)
# # A tibble: 1 x 2
# Answer.y n
# <chr> <int>
# 1 Hey <firstname>, you will find it <here>. 2
请注意,我将<
和>
之外的空格作为占位符的一部分。如果我不这样做,"Hey, Shekhar"
就不会匹配,因为逗号。
编辑以发表评论:
df1 <- dplyr::mutate(df1, Answer_trimmed = gsub("<.*?>", "", Answer))
res <- fuzzy_left_join(df2, df1, by= c(Answer="Answer_trimmed"),
match_fun = function(x,y) stringdist::stringdist(x, y) / nchar(y) < 0.7)
# Category.x Question.x Answer.x Category.y
# 1 Stat1 Where to get books? Hey Anil, you will find it at the helpdesk. Stationary
# 2 Stat1 Procedure to order stationary? Hey, Shekhar, you will find it at the helpdesk. <NA>
# Question.y Answer.y Answer_trimmed
# 1 Where do I get stationary items from? Hey <firstname>, you will find it here. Hey , you will find it here.
# 2 <NA> <NA> <NA>
dplyr::count(res,Answer.y)
# # A tibble: 2 x 2
# Answer.y n
# <chr> <int>
# 1 <NA> 1
# 2 Hey <firstname>, you will find it here. 1