如何在R中使用字典创建双字母?

时间:2015-08-31 05:21:41

标签: r dictionary

我有一个单词词典,我已经存储在dictionary.txt文件中。它包含三元组和双字母组合。现在给我一段:

"In order to perform operations inside the abdomen, surgeons must make an incision large enough to offer adequate visibility, provide access to the abdominal organs and allow the use of hand-held surgical instruments.  These incisions may be placed in different parts of the abdominal wall.  Depending on the size of the patient and the type of operation, the incision may be 6 to 12 inches in length.  There is a significant amount of discomfort associated with these incisions that can prolong the time spent in the hospital after surgery and can limit how quickly a patient can resume normal daily activities.  Because traditional techniques have long been used and taught to generations of surgeons, they are widely available and are considered the standard treatment to which newer techniques must be compared."

dictionary.txt文件包含以下字词:

hand-held surgical instruments
intensive care unit
traditional techniques

现在我想为dictionary.txt中没有的单词创建双字母组。

我在R中使用了以下代码:

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2,max=2))

任何人都可以帮我在R

中告诉代码

1 个答案:

答案 0 :(得分:0)

根据您的文字和字典,我创建了两者的双字母组合,并从字段中删除了字典中的双字母组。

t <- "In order to perform operations inside the abdomen, surgeons must make an incision large enough to offer adequate visibility, provide access to the abdominal organs and allow the use of hand-held surgical instruments.  These incisions may be placed in different parts of the abdominal wall.  Depending on the size of the patient and the type of operation, the incision may be 6 to 12 inches in length.  There is a significant amount of discomfort associated with these incisions that can prolong the time spent in the hospital after surgery and can limit how quickly a patient can resume normal daily activities.  Because traditional techniques have long been used and taught to generations of surgeons, they are widely available and are considered the standard treatment to which newer techniques must be compared."


dictionary <- c("hand-held surgical instruments", "intensive care unit", "traditional techniques")

bigrams_dict <- BigramTokenizer(dictionary)
bigrams_text <- BigramTokenizer(t)

bigrams_text[!bigrams_text %in% bigrams_dict]