Question

我有多行文本数据（不同文档），每行有大约60-70行文本数据（超过50000个字符）。但在这些我感兴趣的领域只有1-2行数据，基于关键字。我想只提取那些关键字/单词组存在的句子。我的假设是，通过仅提取那条信息，我可以有更好的POS标记并更好地理解句子上下文，因为我只看到我需要的句子。我的理解是正确的，除了使用正则表达式和完全停止之外，我们如何在R中实现这一点。这可能是计算密集型的。

例如：男孩住在迈阿密，在圣路易斯学习。马丁学校。男孩的身高是5.7＆＃34;和重量60公斤。他对艺术和手工艺有兴趣;打篮球.............................................. .................................................. ................

我只想提取句子＆＃34; 男孩住在迈阿密，并在圣路易斯学习。马丁学校＆＃34;基于关键词研究（词干关键词）。

Answer 1

对于这个例子，我使用了三个包：NLP和openNLP（用于句子分割）和SnowballC（用于lemmatize）。我没有使用上面提到的tokenizer包，因为我不知道它。我提到的软件包是Apache OpenNLP工具包的一部分，是社区众所周知的。

首先，使用下面的代码安装提到的软件包。如果已安装软件包，请跳至下一步：

## List of used packages 
list.of.packages <- c("NLP", "openNLP", "SnowballC")

## Returns a not installed packages list
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]

## Installs new packages
if(length(new.packages)) 
  install.packages(new.packages)

接下来，加载用过的包：

library(NLP)
library(openNLP)
library(SnowballC)

接下来，将文本转换为字符串（NLP包函数）。这是必要的，因为openNLP包使用String类型。在此示例中，我使用了您在问题中提供的相同文本：

example_text <- paste0("The Boy lives in Miami and studies in the St. Martin School. ",
                       "The boy has a heiht of 5.7 and weights 60 Kg's. ", 
                       "He has intrest in the Arts and crafts; and plays basketball. ")

example_text <- as.String(example_text)

#output
> example_text
The Boy lives in Miami and studies in the St. Martin School. The boy has a heiht of 5.7 and weights 60 Kg's. He has intrest in the Arts and crafts; and plays basketball.

接下来，我们使用openNLP包生成一个句子注释器，通过句子检测器计算注释：

sent_annotator <- Maxent_Sent_Token_Annotator()
annotation <- annotate(example_text, sent_annotator)

接下来，通过文中的注释，我们可以提取句子：

splited_text <- example_text[annotation]
#output
splited_text
[1] "The Boy lives in Miami and studies in the St. Martin School." 
[2] "The boy has a heiht of 5.7 and weights 60 Kg's. "             
[3] "He has intrest in the Arts and crafts; and plays basketball. "

最后，我们使用支持英语的SnowballC包的wordStem函数。此功能可将单词或单词矢量缩减为其基础（通用基本形式）。接下来，我们使用基础包R的grep函数来查找包含我们要查找的关键字的句子：

stemmed_keyword <- wordStem ("study", language = "english")
sentence_index<-grep(stemmed_keyword, splited_text)
#output
splited_text[sentence_index]
[1] "The Boy lives in Miami and studies in the St. Martin School."

注意

请注意，我已经更改了您从**＆＃34; ... st。提供的示例文本。马丁学校。＆＃34; **到**＆＃34; ...圣马丁学校。＆＃34; **。如果这封信＆＃34; s＆＃34;保持小写，句子检测器会理解＆＃34; st。＆＃34;中的标点符号。是一个终点。带分裂句子的向量如下：

> splited_text [1] "The Boy lives in Miami and studies in the st." "Martin School." [3] "The boy has a heiht of 5.7 and weights 60 Kg's." "He has intrest in the Arts and crafts; and plays basketball."

因此，在此向量中检查关键字时，输出将为：

> splited_text[sentence_index] [1] "The Boy lives in Miami and studies in the st."

我还测试了上面提到的tokenizer包，也遇到了同样的问题。因此，请注意这是NLP注释任务中的一个开放问题。但是，上述逻辑和算法工作正常。

我希望这会有所帮助。

Answer 2

对于每个文档，您可以先将SnowballC::wordStem应用于lemmatize，然后使用tokenizers::tokenize_sentences拆分文档。现在，您可以使用grepl查找包含您要查找的关键字的句子。

NLP：仅提取R中整个文本的特定句子

2 个答案: