我正在处理一组数据,我需要对其进行标记以进行训练。在进行标记化之前,我创建了一个字典,以便我需要检索字典中存在的那些单词。
我的文本文件如下:
t <- "In order to perform operations inside the abdomen, surgeons must make an incision large enough to offer adequate visibility, provide access to the abdominal organs and allow the use of hand-held surgical instruments. These incisions may be placed in different parts of the abdominal wall. Depending on the size of the patient and the type of operation, the incision may be 6 to 12 inches in length. There is a significant amount of discomfort associated with these incisions that can prolong the time spent in the hospital after surgery and can limit how quickly a patient can resume normal daily activities. Because traditional techniques have long been used and taught to generations of surgeons, they are widely available and are considered the standard treatment to which newer techniques must be compared."
我的词典包括:
dict <- c("hand-held surgical instruments", "intensive care unit", "traditional techniques")
现在我已经对文档中的单词应用了双字母标记。为此,我使用了以下代码:
#Preprocessing of data
corpus <- Corpus(VectorSource(t))
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,PlainTextDocument)
#Bigram Tokenization
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm <- TermDocumentMatrix(corpus,control=list(tokenize=BigramTokenizer, dictionary=dict))
但是我得到了输出:
<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity : 67%
Maximal term length: 30
Weighting : term frequency (tf)
Docs
Terms character(0)
hand-held surgical instruments 0
intensive care unit 0
traditional techniques 1
但是我需要使用bigrams来标记词典中没有的词。有人可以帮帮我吗?
答案 0 :(得分:0)
您需要检查字典的功能。它只返回字典中的单词。
词典: 要列表的字符向量。 结果中不会列出其他条款。默认为NULL,表示列出了doc中的所有术语。
您可以使用的是以下代码。请注意,removePunctuation还会删除“手持”之间的连字符。也没有必要。令牌处理器无论如何都会删除大部分的点击。
编辑:基于评论
#Preprocessing of data
corpus <- Corpus(VectorSource(t))
corpus <- tm_map(corpus,content_transformer(tolower))
corpus <- tm_map(corpus,PlainTextDocument)
#Tokenizers
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
# dictionary bigrams removed.
tdm_bigram_no_dict <- TermDocumentMatrix(corpus,control=list(stopwords = BigramTokenizer(dict), tokenize = BigramTokenizer))
# dictionary bigrams from corpus
tdm_bigram_dict <- TermDocumentMatrix(corpus,control=list(tokenize = BigramTokenizer, dictionary = dict))
inspect(tdm_bigram_dict)
<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity : 67%
Maximal term length: 30
Weighting : term frequency (tf)
Docs
Terms character(0)
hand-held surgical instruments 0
intensive care unit 0
traditional techniques 1
# dictionary trigrams from corpus
tdm_trigram_dict <- TermDocumentMatrix(corpus,control=list(tokenize = TrigramTokenizer, dictionary = dict))
inspect(tdm_trigram_dict)
<<TermDocumentMatrix (terms: 3, documents: 1)>>
Non-/sparse entries: 1/2
Sparsity : 67%
Maximal term length: 30
Weighting : term frequency (tf)
Docs
Terms character(0)
hand-held surgical instruments 1
intensive care unit 0
traditional techniques 0
# combine term document matrices into one. you can use rbind since tdm's are sparse matrices. If you want extra speed, look into the slam package.
tdm_total <- rbind(tdm_bigram_no_dict, tdm_bigram_dict, tdm_trigram_dict)
由于在哪里使用rowbind,因此根据字典结果会有双重记录。但是,如果要进一步处理数据,可以将它们转换为类似的数据帧,并使用dplyr将它们分组到一行:
library(dplyr)
df <- data.frame(terms = rownames(as.matrix(tdm_total)), freq = rowSums(as.matrix(tdm_total)), row.names = NULL, stringsAsFactors = FALSE)
df <- df %>% group_by(terms) %>% summarise(sum(freq))