根据查询中不常用的单词过滤短语数据帧

时间:2018-03-22 22:32:44

标签: r

我想根据查询短语中一个或多个单词与dataframe列单元格中的短语的部分匹配来过滤数据框中的行。我尝试了以下开始徒劳无功。我意识到完成我正在寻找的东西将是非常具有挑战性的。因此,我正在寻找进一步实现这一目标的策略。

library(stringr)
library(dplyr)

doc <- structure(list(LineNumber = structure(1:4, .Label = c("line 1", 
"line 2", "line 3", "line 4"), class = "factor"), Statement = structure(c(2L, 
1L, 4L, 3L), .Label = c("Harry and Larry went down the hill", 
"Jack and Jill went up the hill", "Jack fell down broke and broke his crown", 
"Tom climbed up the hill"), class = "factor")), .Names = c("LineNumber", 
"Statement"), class = "data.frame", row.names = c(NA, -4L))

query <- "went up hill"

doc %>% filter(str_detect(Statement, query))

## The answer expected was "Jack and Jill went up the hill"

1 个答案:

答案 0 :(得分:1)

You might be able to frame this as a text mining problem, but as the commenters stated above, this seems like a very broad and not well specified problem (the type of query you want to make, I believe, would be better suited by using a declarative programming language like Prolog). Regardless, here's how you might set it up in tidytext and chop out all the common words:

> library(tidytext)
> doc.t <- as.tibble(doc)
> doc.t$Statement <- as.character(doc.t$Statement)
> doc.t %>% unnest_tokens(word,Statement) %>% anti_join(stop_words)
Joining, by = "word"
# A tibble: 14 x 2
   LineNumber word   
   <fct>      <chr>  
 1 line 1     jack   
 2 line 1     jill   
 3 line 1     hill   
 4 line 2     harry  
 5 line 2     larry  
 6 line 2     hill   
 7 line 3     tom    
 8 line 3     climbed
 9 line 3     hill   
... the rest are truncated

The problem with eliminating the most common words is that it will also eliminate your ability to examine phrases like "went up the hill" where the sequence and ordering of the words is important.