Question

我正在使用R text2vec包来创建document-term-matrix。这是我的代码：

library(lime)
library(text2vec) 

# load data
data(train_sentences, package = "lime")  

#
tokens <- train_sentences$text %>%  
   word_tokenizer

it <- itoken(tokens, progressbar = FALSE)

stop_words <- c("in","the","a","at","for","is","am") # stopwords
vocab <- create_vocabulary(it, c(1L, 2L), stopwords = stop_words) %>%   
  prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
vectorizer <- vocab_vectorizer(vocab )

dtm <- create_dtm(it , vectorizer, type = "dgTMatrix")

另一种方法是hash_vectorizer（）而不是vocab_vectorizer（）：

h_vectorizer <- hash_vectorizer(hash_size = 2 ^ 10, ngram = c(1L, 2L))
dtm <- create_dtm(it,h_vectorizer)

但是，当我使用hash_vectorizer时，没有用于停用词删除和修剪词汇的选项。在研究案例中，hash_vectorizer对我而言比vocab_vectorizer更好。我知道可以在创建dtm甚至创建令牌后删除停用词。是否还有其他选项，类似于vocab_vectorizer及其创建方式。我特别感兴趣的是一种也支持类似于prune_vocabulary（）的修剪词汇的方法。

感谢您的答复。谢谢，山姆

Answer 1

这是不可能的。使用hash_vectorizer和功能哈希的全部目的是避免哈希映射查找（获取给定单词的索引）。从本质上讲，删除停用词是一件很重要的事情-检查停用词集中是否包含单词。通常，仅当数据集非常大并且构建词汇表需要花费大量时间/内存时，才建议使用hash_vectorizer。否则，根据我的经验，vocab_vectorizer和prune_vocabulary的表现至少不会更差。

此外，如果您将hash_vectorized与较小的hash_size一起使用，则它可以作为降维步骤，因此可以减少数据集的方差。因此，如果您的数据集不是很大，我建议使用vocab_vectorizer并使用prune_vocabulary参数来减少词汇量和文档术语矩阵的大小。

R text2vec程序包中的哈希矢量化程序，带有停用词删除选项

1 个答案: