使用R搜索特定的文本模式并返回模式出现的整个句子

时间:2018-04-17 01:24:28

标签: r text-mining

所以我在物理文档中扫描,将其更改为tiff图像并使用Tesseract包将其导入R.但是,我需要R来查找特定关键字,在文本文件中找到它并返回整行该关键字位于。

例如,如果我有文本文件:

  

这也很简单。查看所需的多年经验,看看它是否与候选人的多年经验相匹配。重要的是要注意,如果候选人匹配或超过所需的经验年数,您将这两个场景评为“5”。

我告诉R搜索关键字"直截了当",如何让它返回"这也很简单...看看是否与"?

2 个答案:

答案 0 :(得分:0)

Here is one base R option:

Mozilla/5.0 (Windows NT 10.0; Microsoft Windows 10.0.15063; en-US) PowerShell/6.0.0

I am splitting your text on the pattern text <- "This is also straightforward. Look at the years of experience required and see if that matches the years of experience that the candidate has. It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a “5”." lst <- unlist(strsplit(text, "(?<=[a-z]\\.\\s)", perl=TRUE)) lst[grepl("\\bstraightforward\\b", lst)] , which says to lookbehind for a lowercase letter, following by a full stop and a space. This should work well most of the time. There is the issue of abbreviations, but most of the time they would be in the form of capital letter followed by dot, and also most of the time they would not be ending sentences.

Demo

答案 1 :(得分:0)

以下是使用quanteda包将文本分解为句子的解决方案,然后使用grep()返回包含单词&#34;直接&#34;的句子。

aText <- "This is also straightforward. Look at the years of experience required and see if that matches the years of experience that the candidate has. It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a “5”."
library(quanteda)
aCorpus <- corpus(aText)
theSentences <- tokens(aCorpus,what="sentence")
grep("straightforward",theSentences,value=TRUE)

和输出:

> grep("straightforward",theSentences,value=TRUE)
                          text1 
"This is also straightforward." 

要搜索多个关键字,请通过或运算符|将其添加到grep()函数中。

grep("straightforward|exceeds",theSentences,value=TRUE)

...和输出:

> grep("straightforward|exceeds",theSentences,value=TRUE)

text1 

"This is also straightforward." 

<NA> 
"It is important to note that if the candidate matches or exceeds the years of experience required, you would rate both of those scenarios a \"5\"."