Question

我使用tf-idf方法选择了文本中最重要的词。问题在于，这种方法给我带来了很多相关性很高且代表相同上下文的单词，这并没有给我带来新的信息。因此，我想最大化彼此无关的“重要”单词的数量。

我得出以下解决方案：

text <- read_csv('texto.csv')

tfidf <- text %>%
unnest_tokens(word, `Texto do Comentário`) %>%
count(word, document) %>%
bind_tf_idf(word, document, n) %>%
top_n(10, tf_idf)

# Now, I use the words generated by tf-idf to find out how these words correlate with the others on the corpus.

correlations <- text %>%
unnest_tokens(word, `Texto do Comentário`) %>%
pairwise_cor(word, document) %>%
filter(item2 %in% tfidf$word)

所以，这是我能得到的最远的。现在，我想将高度相关的单词（相关性> .7）聚类，然后将它们折叠成两个单词之间最相关的单词。我不确定最好的方法是什么（PCA？Factor Analysis？），并且在Internet上找不到关于此任务的帮助。

如何折叠非常相关的单词？

0 个答案: