R中的词干完成取代了名称,而不是数据

时间:2018-04-04 22:27:00

标签: r tm topic-modeling quanteda

我的团队正在使用R中的Quanteda包对中等大小的文本(成千上万个单词)进行一些主题建模。我想在主题建模过程之前将单词缩减为单词词干,以便我不计算同一个词的变化与不同的主题。

唯一的问题是词干算法留下了一些不是真正单词的单词。 “幸福”源于“幸福”,“安排”源于“阿朗”,等等。所以,在我想象主题建模的结果之前,我想恢复词干以完成单词。

通过阅读StackOverflow上的一些先前的线程,我从TM包中找到了一个函数stemCompletion(),它执行此操作at least approximately。它似乎工作得相当好。

但是当我将它应用于文档文本矩阵中的术语向量时,stemCompletion()总是替换字符向量的名称,而不是字符本身。这是一个可重复的例子:

# Set up libraries
library(janeaustenr)
library(quanteda)
library(tm)

# Get first 200 words of Mansfield Park
words <- head(mansfieldpark, 200)

# Build a corpus from words
corpus <- quanteda::corpus(words)

# Eliminate some words from counting process
STOPWORDS <- c("the", "and", "a", "an")

# Create a document text matrix and do topic modeling
dtm <- corpus %>% 
    quanteda::dfm(remove_punct = TRUE,
                  remove = STOPWORDS) %>%
    quanteda::dfm_wordstem(.) %>% # Word stemming takes place here
    quanteda::convert("topicmodels")

# Word stems are now stored in dtm$dimnames$Terms

# View a sample of stemmed terms
tail(dtm$dimnames$Terms, 20)

# View the structure of dtm$dimnames$Terms (It's just a character vector)
str(dtm$dimnames$Terms)

# Apply tm::stemCompletion to Terms
unstemmed_terms <-
    tm::stemCompletion(dtm$dimnames$Terms, 
                       dictionary = words, # or corpus
                       type = "shortest")

# Result is composed entirely of NAs, with the values stored as names!
str(unstemmed_terms)

tail(unstemmed_terms, 20)

我正在寻找一种方法将stemCompletion()返回的结果转换为字符向量,而不是字符向量的names属性。对此问题的任何见解都非常感谢。

1 个答案:

答案 0 :(得分:4)

问题是你dictionary的{​​{1}}参数不是单词的字符向量(或 tm 语料库对象),而是来自的一组行奥斯汀小说。

tm::stemCompletion()

但是,可以使用 quanteda tail(words) # [1] "most liberal-minded sister and aunt in the world." # [2] "" # [3] "When the subject was brought forward again, her views were more fully" # [4] "explained; and, in reply to Lady Bertram's calm inquiry of \"Where shall" # [5] "the child come to first, sister, to you or to us?\" Sir Thomas heard with" # [6] "some surprise that it would be totally out of Mrs. Norris's power to" 轻松对其进行标记,并将其转换为字符向量。

tokens()