Question

我正在使用“ tm”包在R中创建DocumentTermMatrix。它适用于1克，但我正尝试使用tm包和来自“”的tokenize_ngrams函数创建N-Grams（目前为N = 3）的DocumenttermMatrix。标记器”软件包。但是我无法创建它。

我搜索了可能的解决方案，但没有得到太多帮助。出于隐私原因，我无法共享数据。这是我尝试过的，

library(tm)  
library(tokenizers)

data是一个约有4.5k行和2列（即“ doc_id”和“ text”）的数据框

data_corpus = Corpus(DataframeSource(data))

用于n-gram标记的自定义函数：

ngram_tokenizer = function(x){
  temp = tokenize_ngrams(x, n_min = 1, n = 3, stopwords = FALSE, ngram_delim = "_")
  return(temp)
}

用于DTM创建的控制列表：
1克

control_list_unigram = list(tokenize = "words",
                          removePunctuation = FALSE,
                          removeNumbers = FALSE, 
                          stopwords = stopwords("english"), 
                          tolower = T, 
                          stemming = T, 
                          weighting = function(x)
                            weightTf(x)
)

用于N元语法标记化

control_list_ngram = list(tokenize = ngram_tokenizer,
                    removePunctuation = FALSE,
                    removeNumbers = FALSE, 
                    stopwords = stopwords("english"), 
                    tolower = T, 
                    stemming = T, 
                    weighting = function(x)
                      weightTf(x)
                    )


dtm_unigram = DocumentTermMatrix(data_corpus, control_list_unigram)
dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

dim(dtm_unigram)
dim(dtm_ngram)

两个dtm的尺寸相同。
请纠正我！

Answer 1

不幸的是，tm有一些令人讨厌的怪癖，而且并不总是很清楚。首先，令牌化似乎不适用于创建的Corpus的语料库。您需要为此使用VCorpus。

因此将data_corpus = Corpus(DataframeSource(data))行更改为data_corpus = VCorpus(DataframeSource(data))。

这是一个要解决的问题。现在，语料库将用于标记化，但是现在您会遇到tokenize_ngrams的问题。您将收到以下错误：

Input must be a character vector of any length or a list of character
  vectors, each of which has a length of 1.

运行此行时：dtm_ngram = DocumentTermMatrix(data_cropus, control_list_ngram)

要解决此问题，并且不依赖于tokenizer包，可以使用以下函数对数据进行令牌化。

NLP_tokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), 1:3), paste, collapse = "_"), use.names = FALSE)
}

这使用NLP软件包中的ngrams函数，该函数在加载tm软件包时加载。 1：3告诉它创建1到3个单词的ngram。因此，您的control_list_ngram应该如下所示：

control_list_ngram = list(tokenize = NLP_tokenizer,
                          removePunctuation = FALSE,
                          removeNumbers = FALSE, 
                          stopwords = stopwords("english"), 
                          tolower = T, 
                          stemming = T, 
                          weighting = function(x)
                            weightTf(x)
                          )

我个人会使用quanteda软件包来完成所有这些工作。但是现在这应该会对您有所帮助。

在R中使用N语法创建文档术语矩阵

1 个答案: