Question

我有兴趣创建一个类似于此人员网站上显示的网络图 - 本页上的第一个＆gt;＆gt; http://minimaxir.com/2016/12/interactive-network/

我想在.txt文档中创建此图的节点==单词（在删除停用词和其他预处理之后）。我还想使该图的顶点/边是与文档中其他单词的相关性（例如单词“单词”经常出现在单词“up”旁边），仅考虑更强的相关性。我在整体文档中考虑“节点的大小”=“单词的频率”，并且“节点之间的距离”= “言语之间关系的力量/弱点。

我目前正在使用R，quanteda和ggplot2的组合以及其他一些依赖项。

如果有人对如何在R中生成单词关联有任何建议（最好是使用quanteda），然后将其作为图表绘制，我将永远感激不尽！

当然，如果我对这个问题有任何改进，请告诉我。这是我到目前为止的尝试：

library(quanteda)
library(readtext)
library(ggplot2)
library(stringi)

## Load the .txt doc 
document <- texts(readtext("file1.txt"))

## Make everything lowercase... store in a seperate variable
documentlower <- char_tolower(document$text)

## Tokenize the lower-case document
documenttokens <- tokens(documentlower, remove_punct = TRUE) %>% as.character()
(total_length <- length(documenttokens)

## Create the Document Frequency Matrix - here we can also remove stopwords and stem
docudfm <- dfm(documentlower, remove_punct = TRUE, remove = stopwords("english"), stem = TRUE)

## Inspect the top 10 Words by Count
textstat_frequency(docudfm, n = 10)

## Create a sorted list of tokens by frequency count
sorted_document <- topfeatures(docudfm, n = nfeat(docudfm))

## Normalize the data points to find their percentage of occurrence in the documents
sorted_document <- sorted_document / sum(sorted_document) * 100

## Also normalize the data points in the DFM
docudfm_pct <- dfm_weight(docudfm, scheme = "prop") * 100

从文档中单词之间的相关性生成网络图

0 个答案: