使用R,我如何测试短语的数据帧以查看它是否包含关键字

时间:2015-09-29 21:48:11

标签: regex r grep dataframe

我有2个数据帧。一个包含像这样的搜索短语

   search.phrases
1  the quick
2  brown fox jumps
3  over the lazy 
5  dog
6  why
7  nobody knows
   ...

和另一个包含关键字

   keywords
1  quick
2  lazy 
3  dog
4  knows
   ...

理想情况下,我想找到哪些搜索词组包含一个或多个(布尔值或计数)这样的关键词

   search.phrases      keyword.found     
1  the quick           TRUE
2  brown fox jumps     FALSE      
3  over the lazy       TRUE
5  dog                 TRUE
6  why                 FALSE
7  nobody knows         TRUE
   ...

我已经尝试了一段时间,但我很难过。非常感谢任何帮助。

很多爱情

G

2 个答案:

答案 0 :(得分:3)

您可以使用grepl()

rgx <- paste(as.character(df2$keywords), collapse = "|")
df$keyword.found <- grepl(rgx, df$search.phrases)

<强>结果:

   search.phrases keyword.found
1       the quick          TRUE
2 brown fox jumps         FALSE
3   over the lazy          TRUE
5             dog          TRUE
6             why         FALSE
7    nobody knows          TRUE

数据:

df2 <- structure(list(keywords = structure(c(4L, 3L, 1L, 2L), .Label = c("dog", 
"knows", "lazy", "quick"), class = "factor")), .Names = "keywords", class = "data.frame", row.names = c("1", 
"2", "3", "4"))
df <- structure(list(search.phrases = structure(c(5L, 1L, 4L, 2L, 6L, 
3L), .Label = c("brown fox jumps", "dog", "nobody knows", "over the lazy", 
"the quick", "why"), class = "factor")), .Names = "search.phrases", class = "data.frame", row.names = c("1", 
"2", "3", "5", "6", "7"))

答案 1 :(得分:1)

c("the quick fox", "had a dog", "named bruce") -> phrases
c("quick", "bruce") -> keywords
library(stringr)
str_split(phrases, " ") -> phrase_list
sapply(phrase_list, function(x) any(ifelse(x %in% keywords, TRUE, FALSE))) -> z