正则表达式以匹配并放入类别-R

时间:2018-11-07 12:09:52

标签: r regex string rstudio

我有3个向量。一个包含文本或实际单词/句子(文本),一个矢量包含我要搜索的单词(xreg),第三个矢量(类别)包含如果找到匹配项,每个文本应属于的类别。这是3个向量:

text <- c("Sole Service here", "Freedom to Include","Freedom to Incl","Premier Reg",
"Bankhall","Bankhall","Premier Regiona","St James Play",
"Premier Regional","Health online","Premier Regional",
"Tenet","Health on line","Tenet","Nations","Woolwich",
"Premier Regional","Lifesearch","Nations","Bankhall",
"Premier Regional","Sole Service her","Lifesearch",
"Premier Regional","Sole Service","Nations",
"Sole Service","First Money service","Sole Service",
"Nations wide","Sole Service","Premier Region")

text <- tolower(text)

xreg <- c("sole","freedom","premier","bankhall","james","health","tennet",
          "nations","woolwich","life","money")

categories <- c("SS", "FD", "PR", "BK", "JM", "HT", "TT", "NT", "WW", "LF", "MY")

我想基于“ xreg ”向量中存在的搜索词搜索“ 文本”向量。然后,找到匹配项后,我要将这些单词放入“ 类别”向量中提到的类别。

例如,寻找单词“ sole”,并在该单词的索引下找到匹配的注释,或者只是简单地创建一个带有单词的数据框,然后在单独的列中说明它也应属于的类别。如果是“唯一”,则将其放在“ SS”类别中。 “自由”将其归类为“ FD”类别,依此类推。

到目前为止的解决方案: 我可以为每个关键字一个一个地搜索,它会告诉我找到匹配项的索引。

 reg_func <- function(x){grep(x,text)  
    }
    reg_func("sole")
reg_func("freedom")

这将为我提供每个匹配单词的索引,然后我可以使用它们来更新类别。有什么办法可以更快地做到吗?而不是一次搜索一个单词?谢谢

2 个答案:

答案 0 :(得分:1)

您可以这样做:

数据:(已修改为在1个条目中有一个双匹配项,而在最后一个条目中没有匹配项)

text <- c("Sole Service here, premier", "Freedom to Include","Freedom to Incl","Premier Reg",
          "Bankhall","Bankhall","Premier Regiona","St James Play",
          "Premier Regional","Health online","Premier Regional",
          "Tenet","Health on line","Tenet","Nations","Woolwich",
          "Premier Regional","Lifesearch","Nations","Bankhall",
          "Premier Regional","Sole Service her","Lifesearch",
          "Premier Regional","Sole Service","Nations",
          "Sole Service","First Money service","Sole Service",
          "Nations wide","Sole Service","Premier Region", "no match in here!!!")

#text <- tolower(text) # not needed, use ignore.case = T later

xreg <- c("sole","freedom","premier","bankhall","james","health","tennet",
          "nations","woolwich","life","money")

categories <- c("SS", "FD", "PR", "BK", "JM", "HT", "TT", "NT", "WW", "LF", "MY")

代码:

names(categories) = xreg  # create named vector

ans <- data.frame(text = I(text)) # create a data.frame where you store it all.

ans$xreg_m<-
apply(
    sapply(xreg, function(x) {grepl(x, text, ignore.case = T)}), 1, function(x) xreg[x]
      )
ans$xreg_m[!lengths(ans$xreg_m)] <- NA  # if no match is found. character(0) is returned. I want to have NA instead. character(0) has a length of 0. I'm using this knowledge to find them.

ans$categories_m<-
    sapply(ans$xreg_m, function(x) unique(unname( categories[x] )))

结果:

#                         text        xreg_m categories_m
#1  Sole Service here, premier sole, premier       SS, PR
#2          Freedom to Include       freedom           FD
#3             Freedom to Incl       freedom           FD
#4                 Premier Reg       premier           PR
#5                    Bankhall      bankhall           BK
#6                    Bankhall      bankhall           BK
#7             Premier Regiona       premier           PR
#8               St James Play         james           JM
#9            Premier Regional       premier           PR
#10              Health online        health           HT
#11           Premier Regional       premier           PR
#12                      Tenet            NA           NA
#13             Health on line        health           HT
#14                      Tenet            NA           NA
#15                    Nations       nations           NT
#16                   Woolwich      woolwich           WW
#17           Premier Regional       premier           PR
#18                 Lifesearch          life           LF
#19                    Nations       nations           NT
#20                   Bankhall      bankhall           BK
#21           Premier Regional       premier           PR
#22           Sole Service her          sole           SS
#23                 Lifesearch          life           LF
#24           Premier Regional       premier           PR
#25               Sole Service          sole           SS
#26                    Nations       nations           NT
#27               Sole Service          sole           SS
#28        First Money service         money           MY
#29               Sole Service          sole           SS
#30               Nations wide       nations           NT
#31               Sole Service          sole           SS
#32             Premier Region       premier           PR
#33        no match in here!!!            NA           NA

答案 1 :(得分:0)

解释@Andre Elrico答案中使用的功能

apply(
  sapply(xreg, function(x) {grepl(x, text, ignore.case = T)}), 1, function(x) xreg[x]
)

# Apply each xreg pattern to the text vector and see if there's a match  
# result is TRUE or FALSE gives each index where there is a match
sapply(xreg, function(x) {grepl(x, text, ignore.case = T)})

结果

      sole freedom premier bankhall james health tennet nations woolwich  life money
[1,]  TRUE   FALSE    TRUE    FALSE FALSE  FALSE  FALSE   FALSE    FALSE FALSE FALSE
[2,] FALSE    TRUE   FALSE    FALSE FALSE  FALSE  FALSE   FALSE    FALSE FALSE FALSE
[3,] FALSE    TRUE   FALSE    FALSE FALSE  FALSE  FALSE   FALSE    FALSE FALSE FALSE
[4,] FALSE   FALSE    TRUE    FALSE FALSE  FALSE  FALSE   FALSE    FALSE FALSE FALSE
[5,] FALSE   FALSE   FALSE     TRUE FALSE  FALSE  FALSE   FALSE    FALSE FALSE FALSE
[6,] FALSE   FALSE   FALSE     TRUE FALSE  FALSE  FALSE   FALSE    FALSE FALSE FALSE

# Now apply each xreg element to the TRUE's from the previous result 
# and see which element of xreg it matches with
apply(
  sapply(xreg, function(x) {grepl(x, text, ignore.case = T)}), 1, function(x) xreg[x]
)

结果

[[1]]
[1] "sole"    "premier"

[[2]]
[1] "freedom"

[[3]]
[1] "freedom"

[[4]]
[1] "premier"

[[5]]
[1] "bankhall"

[[6]]
[1] "bankhall"

现在获取每个匹配词(正则表达式)的类别

sapply(ans$xreg_m, function(x) unique(unname( categories[x] )))

其中说:

# Take each element of xreg_m (our matched terms) and 
# see which element in the categories vector it matches with 
#  Then unname the result so you only get the category