在分析了大量科学论文之后,我结束了一个约 32000术语的术语 - 术语邻接矩阵,如下例所示:
Terms
Terms abart abb abbieg abbiegelicht abblend abblendlicht abbrech abbrems abdeck abend
abart 4 1 0 0 0 1 0 0 2 0
abb 1 6 0 4 2 4 0 0 3 2
abbieg 0 0 4 3 3 2 0 0 3 3
abbiegelicht 0 4 3 12 7 8 1 1 5 5
abblend 0 2 3 7 12 6 1 1 7 10
abblendlicht 1 4 2 8 6 20 3 3 10 11
abbrech 0 0 0 1 1 3 4 2 2 1
abbrems 0 0 0 1 1 3 2 5 3 2
abdeck 2 3 3 5 7 10 2 3 17 9
abend 0 2 3 5 10 11 1 2 9 1
我收到了一份通常汇集在一起的单词列表。相邻出现的频率也非常高:
wordsToStudy <- c("dog","cat","cow","horse","donkey","goat", "elephant")
我的任务是分析这些词之间的关系程度,同时也发现哪些其他词(最初没有给出)也有一个高的&#34; adjacency-index&#34;用给定的术语。
例如,我假设单词&#34; mouse&#34;将出现在其他术语中,因为尽管&#34; cat&#34;可能与鼠标有最大关联,&#34; dog&#34; (因为两者都是宠物)和&#34; elephant&#34; (因为它对老鼠的恐慌),可能经常出现在彼此旁边。
我认为最大的障碍是找到不仅与给定词语有很强关联,而且彼此之间也有任何关联的词汇。如果鼠标被接受为可能的新术语,并且&#34; bull&#34;也(因为它与&#34;牛&#34;的关联),可能鼠标和公牛没有任何关系。这不是理想的,但当然可以接受。
完美的方法是能够手动确定减少的术语 - 术语邻接矩阵中包含的术语(简称50,100,200术语......)。我认为最终输出应该是邻接图,如下图所示:
我对统计学有非常基本的了解,也许我正在寻找的东西在这门科学中有一个特定的术语。也许你知道一个包能够做到这一点,因为我不知道如何手动完成它。
我的实际代码是:
# Packages ----------------------------------------------------------------
require(RWeka)
require(tau)
require(tm)
require(tm.plugin.webmining)
require(wordcloud)
require(igraph)
#specify where is the directory of the files.
folderdir="C:/texts"
#load the corpus.
corpus <- Corpus(DirSource(folderdir, encoding = "UTF-8"), readerControl=list(reader=readPlain,language="de"))
#cleanse the corpus.
ds0.1g <- tm_map(corpus, content_transformer(tolower))
ds1.1g <- tm_map(ds0.1g, content_transformer(removeWords), stopwords("german"))
ds2.1g <- tm_map(ds1.1g, stripWhitespace)
ds3.1g <- tm_map(ds2.1g, removePunctuation)
ds4.1g <- tm_map(ds3.1g, stemDocument)
ds4.1g <- tm_map(ds4.1g, removeNumbers)
#create matrixes.
tdm.1g <- TermDocumentMatrix(ds4.1g)
dtm.1g <- DocumentTermMatrix(ds4.1g)
#reduce the sparcity.
tdm89.1g <- removeSparseTerms(tdm.1g, 0.89)
tdm9.1g <- removeSparseTerms(tdm.1g, 0.9)
tdm91.1g <- removeSparseTerms(tdm.1g, 0.91)
tdm92.1g <- removeSparseTerms(tdm.1g, 0.92)
tdm2.1g <- tdm92.1g
# Creates a Boolean matrix (counts # docs w/terms, not raw # terms)
tdm3.1g <- inspect(tdm2.1g)
tdm3.1g[tdm3.1g>=1] <- 1
# Transform into a term-term adjacency matrix
> termMatrix.1gram <- tdm3.1g %*% t(tdm3.1g)
# build a graph from the above matrix
g <- graph.adjacency(termMatrix.1gram, weighted=T, mode = "undirected")
# remove loops
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)
# set seed to make the layout reproducible
set.seed(3952)
layout1 <- layout.fruchterman.reingold(g)
# set label size of vertices based on their degrees and width and transparency of edges based on their weights
V(g)$label.cex <- 2.2 * V(g)$degree / max(V(g)$degree)+ .2
V(g)$label.color <- rgb(0, 0, .2, .8)
V(g)$frame.color <- NA
egam <- (log(E(g)$weight)+.4) / max(log(E(g)$weight)+.4)
E(g)$color <- rgb(.5, .5, 0, egam)
E(g)$width <- egam
# plot the graph in layout1
plot(g, layout=layout1)