如何用R中的文档搜索字典中的单词?

时间:2015-09-03 03:48:27

标签: r dictionary

我创建了一个单词词典。现在我需要检查字典中的单词是否存在于文档中。该文件的样本如下:

Laparoscopic surgery, also called minimally invasive surgery (MIS), bandaid surgery, or keyhole surgery, is a modern surgical technique in which operations are performed far from their location through small incisions (usually 0.5–1.5 cm) elsewhere in the body.

There are a number of advantages to the patient with laparoscopic surgery versus the more common, open procedure. Pain and hemorrhaging are reduced due to smaller incisions and recovery times are shorter. The key element in laparoscopic surgery is the use of a laparoscope, a long fiber optic cable system which allows viewing of the affected area by snaking the cable from a more distant, but more easily accessible location.

从这份文件中,我将每个段落分成如下句子:

[1] "Laparoscopic surgery, also called minimally invasive surgery (MIS), bandaid surgery, or keyhole surgery, is a modern surgical technique in which operations are performed far from their location through small incisions (usually 0.5–1.5 cm) elsewhere in the body."
[2] "There are a number of advantages to the patient with laparoscopic surgery versus the more common, open procedure."                                                          
[3] "Pain and hemorrhaging are reduced due to smaller incisions and recovery times are shorter."                                                                    
[4] "The key element in laparoscopic surgery is the use of a laparoscope, a long fiber optic cable system which allows viewing of the affected area by snaking the cable from a more distant, but more easily accessible location."

字典包括以下字样:

Laparoscopic surgery
minimally invasive surgery
bandaid surgery
keyhole surgery
surgical technique
small incisions
fiber optic cable system

现在我想用R搜索每个句子中的所有单词?我编写的代码如下所示。

c <- "Laparoscopic surgery, also called minimally invasive surgery (MIS), bandaid surgery, or keyhole surgery, is a modern surgical technique in which operations are performed far from their location through small incisions (usually 0.5–1.5 cm) elsewhere in the body.

   There are a number of advantages to the patient with laparoscopic surgery versus the more common, open procedure. Pain and hemorrhaging are reduced due to smaller incisions and recovery times are shorter. The key element in laparoscopic surgery is the use of a laparoscope, a long fiber optic cable system which allows viewing of the affected area by snaking the cable from a more distant, but more easily accessible location."

library(tm)
library(openNLP)

convert_text_to_sentences <- function(text, lang = "en") {
sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = lang)
text <- as.String(text)
sentence.boundaries <- annotate(text, sentence_token_annotator)
sentences <- text[sentence.boundaries]
return(sentences)
}

q <- convert_text_to_sentences(c)

1 个答案:

答案 0 :(得分:1)

假设q是句子的字符向量(或列表),并且您只对关键字的完全匹配感兴趣,那么您可以使用正则表达式:

matches = lapply(q, function(x) dict[sapply(dict, grepl, x, ignore.case=T)])

您将获得q的长度列表。每个列表元素都包含在相应句子中找到的字典单词的向量。