使用R中的Grepl查找数据框列中存在的单词列表

时间:2018-07-11 10:53:35

标签: r data.table string-matching grepl

我有一个数据框df:

df <- structure(list(page = c(12, 6, 9, 65),
text = structure(c(4L,2L, 1L, 3L), 
.Label = c("I just bought a brand new AudiA6", "Get 2 years engine replacement warranty on BMW X6", 
"Volkswagen is the parent company of BMW", "ToyotaCorolla is offering new car exchange offers"), 
class = "factor")), .Names = c("page","text"), row.names = c(NA, -4L), class = "data.frame")

我还有一个单词列表:

wordlist <- c("Audi", "BMW", "extended", "engine", "replacement", "Volkswagen", "company", "Toyota","exchange", "brand")

我通过取消列出文本并使用grepl来查找单词列表中是否存在单词列表中的单词。

library(data.table)
setDT(df)[, match := paste(wordlist[unlist(lapply(wordlist, function(x) grepl(x, text, ignore.case = T)))], collapse = ","), by = 1:nrow(df)]

问题是,我想找到列文本中存在的单词表的确切单词。 使用grepl时,它还显示部分匹配的单词,例如,文本中的AudiA6也与单词列表中存在的奥迪单词部分匹配。另外,我的数据帧很大,使用grepl会花费很多时间来运行代码。请,如果可能的话,推荐其他方法。我想要这样的东西:

df <- structure(list(page = c(12, 6, 9, 65), 
text = structure(c(4L,2L, 1L, 3L), 
.Label = c("I just bought a brand new AudiA6", "Get 2 years engine replacement warranty on BMW X6", 
 "Volkswagen is the parent company of BMW", "ToyotaCorolla is offering new car exchange offers"),
class = "factor"), match = c("exchange", "BMW,engine,replacement", 
"brand", "BMW,Volkswagen,company")), row.names = c(NA, -4L), 
class = c("data.table", "data.frame"))

1 个答案:

答案 0 :(得分:5)

在要提取的每个单词上添加单词边界(str_extract_all)之后,您可以使用stringr中的\\b,因此仅考虑完全匹配(并且您需要折叠全部"|"表示“或”的单词:

sapply(stringr::str_extract_all(df$text, paste("\\b", wordlist, "\\b", sep="", collapse="|")), paste, collapse=",")
# [1] "exchange"               "engine,replacement,BMW" "brand"                  "Volkswagen,company,BMW"

如果要将其放在data.table中:

df[, match:=sapply(stringr::str_extract_all(text, paste("\\b", wordlist, "\\b", sep="", collapse="|")), paste, collapse=",")]
df
#   page                                              text                  match
#1:   12 ToyotaCorolla is offering new car exchange offers               exchange
#2:    6 Get 2 years engine replacement warranty on BMW X6 engine,replacement,BMW
#3:    9                  I just bought a brand new AudiA6                  brand
#4:   65           Volkswagen is the parent company of BMW Volkswagen,company,BMW