我有3个向量。一个包含文本或实际单词/句子(文本),一个矢量包含我要搜索的单词(xreg),第三个矢量(类别)包含如果找到匹配项,每个文本应属于的类别。这是3个向量:
text <- c("Sole Service here", "Freedom to Include","Freedom to Incl","Premier Reg",
"Bankhall","Bankhall","Premier Regiona","St James Play",
"Premier Regional","Health online","Premier Regional",
"Tenet","Health on line","Tenet","Nations","Woolwich",
"Premier Regional","Lifesearch","Nations","Bankhall",
"Premier Regional","Sole Service her","Lifesearch",
"Premier Regional","Sole Service","Nations",
"Sole Service","First Money service","Sole Service",
"Nations wide","Sole Service","Premier Region")
text <- tolower(text)
xreg <- c("sole","freedom","premier","bankhall","james","health","tennet",
"nations","woolwich","life","money")
categories <- c("SS", "FD", "PR", "BK", "JM", "HT", "TT", "NT", "WW", "LF", "MY")
我想基于“ xreg ”向量中存在的搜索词搜索“ 文本”向量。然后,找到匹配项后,我要将这些单词放入“ 类别”向量中提到的类别。
例如,寻找单词“ sole”,并在该单词的索引下找到匹配的注释,或者只是简单地创建一个带有单词的数据框,然后在单独的列中说明它也应属于的类别。如果是“唯一”,则将其放在“ SS”类别中。 “自由”将其归类为“ FD”类别,依此类推。
到目前为止的解决方案: 我可以为每个关键字一个一个地搜索,它会告诉我找到匹配项的索引。
reg_func <- function(x){grep(x,text)
}
reg_func("sole")
reg_func("freedom")
这将为我提供每个匹配单词的索引,然后我可以使用它们来更新类别。有什么办法可以更快地做到吗?而不是一次搜索一个单词?谢谢
答案 0 :(得分:1)
您可以这样做:
数据:(已修改为在1个条目中有一个双匹配项,而在最后一个条目中没有匹配项)
text <- c("Sole Service here, premier", "Freedom to Include","Freedom to Incl","Premier Reg",
"Bankhall","Bankhall","Premier Regiona","St James Play",
"Premier Regional","Health online","Premier Regional",
"Tenet","Health on line","Tenet","Nations","Woolwich",
"Premier Regional","Lifesearch","Nations","Bankhall",
"Premier Regional","Sole Service her","Lifesearch",
"Premier Regional","Sole Service","Nations",
"Sole Service","First Money service","Sole Service",
"Nations wide","Sole Service","Premier Region", "no match in here!!!")
#text <- tolower(text) # not needed, use ignore.case = T later
xreg <- c("sole","freedom","premier","bankhall","james","health","tennet",
"nations","woolwich","life","money")
categories <- c("SS", "FD", "PR", "BK", "JM", "HT", "TT", "NT", "WW", "LF", "MY")
代码:
names(categories) = xreg # create named vector
ans <- data.frame(text = I(text)) # create a data.frame where you store it all.
ans$xreg_m<-
apply(
sapply(xreg, function(x) {grepl(x, text, ignore.case = T)}), 1, function(x) xreg[x]
)
ans$xreg_m[!lengths(ans$xreg_m)] <- NA # if no match is found. character(0) is returned. I want to have NA instead. character(0) has a length of 0. I'm using this knowledge to find them.
ans$categories_m<-
sapply(ans$xreg_m, function(x) unique(unname( categories[x] )))
结果:
# text xreg_m categories_m
#1 Sole Service here, premier sole, premier SS, PR
#2 Freedom to Include freedom FD
#3 Freedom to Incl freedom FD
#4 Premier Reg premier PR
#5 Bankhall bankhall BK
#6 Bankhall bankhall BK
#7 Premier Regiona premier PR
#8 St James Play james JM
#9 Premier Regional premier PR
#10 Health online health HT
#11 Premier Regional premier PR
#12 Tenet NA NA
#13 Health on line health HT
#14 Tenet NA NA
#15 Nations nations NT
#16 Woolwich woolwich WW
#17 Premier Regional premier PR
#18 Lifesearch life LF
#19 Nations nations NT
#20 Bankhall bankhall BK
#21 Premier Regional premier PR
#22 Sole Service her sole SS
#23 Lifesearch life LF
#24 Premier Regional premier PR
#25 Sole Service sole SS
#26 Nations nations NT
#27 Sole Service sole SS
#28 First Money service money MY
#29 Sole Service sole SS
#30 Nations wide nations NT
#31 Sole Service sole SS
#32 Premier Region premier PR
#33 no match in here!!! NA NA
答案 1 :(得分:0)
解释@Andre Elrico答案中使用的功能
apply(
sapply(xreg, function(x) {grepl(x, text, ignore.case = T)}), 1, function(x) xreg[x]
)
# Apply each xreg pattern to the text vector and see if there's a match
# result is TRUE or FALSE gives each index where there is a match
sapply(xreg, function(x) {grepl(x, text, ignore.case = T)})
sole freedom premier bankhall james health tennet nations woolwich life money
[1,] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# Now apply each xreg element to the TRUE's from the previous result
# and see which element of xreg it matches with
apply(
sapply(xreg, function(x) {grepl(x, text, ignore.case = T)}), 1, function(x) xreg[x]
)
结果
[[1]]
[1] "sole" "premier"
[[2]]
[1] "freedom"
[[3]]
[1] "freedom"
[[4]]
[1] "premier"
[[5]]
[1] "bankhall"
[[6]]
[1] "bankhall"
现在获取每个匹配词(正则表达式)的类别
sapply(ans$xreg_m, function(x) unique(unname( categories[x] )))
其中说:
# Take each element of xreg_m (our matched terms) and
# see which element in the categories vector it matches with
# Then unname the result so you only get the category