Question

在分析了大量科学论文之后，我结束了一个约 32000术语的术语 - 术语邻接矩阵，如下例所示：

              Terms
Terms          abart abb abbieg abbiegelicht abblend abblendlicht abbrech abbrems abdeck abend
  abart            4   1      0            0       0            1       0       0      2     0
  abb              1   6      0            4       2            4       0       0      3     2
  abbieg           0   0      4            3       3            2       0       0      3     3
  abbiegelicht     0   4      3           12       7            8       1       1      5     5
  abblend          0   2      3            7      12            6       1       1      7    10
  abblendlicht     1   4      2            8       6           20       3       3     10    11
  abbrech          0   0      0            1       1            3       4       2      2     1
  abbrems          0   0      0            1       1            3       2       5      3     2
  abdeck           2   3      3            5       7           10       2       3     17     9
  abend            0   2      3            5      10           11       1       2      9    1

我收到了一份通常汇集在一起的单词列表。相邻出现的频率也非常高：

wordsToStudy <- c("dog","cat","cow","horse","donkey","goat", "elephant")

我的任务是分析这些词之间的关系程度，同时也发现哪些其他词（最初没有给出）也有一个高的＆＃34; adjacency-index＆＃34;用给定的术语。

例如，我假设单词＆＃34; mouse＆＃34;将出现在其他术语中，因为尽管＆＃34; cat＆＃34;可能与鼠标有最大关联，＆＃34; dog＆＃34; （因为两者都是宠物）和＆＃34; elephant＆＃34; （因为它对老鼠的恐慌），可能经常出现在彼此旁边。

我认为最大的障碍是找到不仅与给定词语有很强关联，而且彼此之间也有任何关联的词汇。如果鼠标被接受为可能的新术语，并且＆＃34; bull＆＃34;也（因为它与＆＃34;牛＆＃34;的关联），可能鼠标和公牛没有任何关系。这不是理想的，但当然可以接受。

完美的方法是能够手动确定减少的术语 - 术语邻接矩阵中包含的术语（简称50,100,200术语......）。我认为最终输出应该是邻接图，如下图所示：

我对统计学有非常基本的了解，也许我正在寻找的东西在这门科学中有一个特定的术语。也许你知道一个包能够做到这一点，因为我不知道如何手动完成它。

我的实际代码是：

# Packages ----------------------------------------------------------------
require(RWeka)
require(tau)
require(tm)
require(tm.plugin.webmining)
require(wordcloud)
require(igraph)

#specify where is the directory of the files.
folderdir="C:/texts"
#load the corpus.
corpus <- Corpus(DirSource(folderdir, encoding = "UTF-8"), readerControl=list(reader=readPlain,language="de"))
#cleanse the corpus.
ds0.1g <- tm_map(corpus, content_transformer(tolower))
ds1.1g <- tm_map(ds0.1g, content_transformer(removeWords), stopwords("german"))
ds2.1g <- tm_map(ds1.1g, stripWhitespace)
ds3.1g <- tm_map(ds2.1g, removePunctuation)
ds4.1g <- tm_map(ds3.1g, stemDocument)
ds4.1g <- tm_map(ds4.1g, removeNumbers)
#create matrixes.
tdm.1g <- TermDocumentMatrix(ds4.1g)
dtm.1g <- DocumentTermMatrix(ds4.1g)
#reduce the sparcity.
tdm89.1g <- removeSparseTerms(tdm.1g, 0.89)
tdm9.1g  <- removeSparseTerms(tdm.1g, 0.9)
tdm91.1g <- removeSparseTerms(tdm.1g, 0.91)
tdm92.1g <- removeSparseTerms(tdm.1g, 0.92)

tdm2.1g <- tdm92.1g

# Creates a Boolean matrix (counts # docs w/terms, not raw # terms)
tdm3.1g <- inspect(tdm2.1g)
tdm3.1g[tdm3.1g>=1] <- 1

# Transform into a term-term adjacency matrix
> termMatrix.1gram <- tdm3.1g %*% t(tdm3.1g)

# build a graph from the above matrix
g <- graph.adjacency(termMatrix.1gram, weighted=T, mode = "undirected")
# remove loops
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)

# set seed to make the layout reproducible
set.seed(3952)
layout1 <- layout.fruchterman.reingold(g)

# set label size of vertices based on their degrees and width and transparency of edges based on their weights
V(g)$label.cex <- 2.2 * V(g)$degree / max(V(g)$degree)+ .2
V(g)$label.color <- rgb(0, 0, .2, .8)
V(g)$frame.color <- NA
egam <- (log(E(g)$weight)+.4) / max(log(E(g)$weight)+.4)
E(g)$color <- rgb(.5, .5, 0, egam)
E(g)$width <- egam
# plot the graph in layout1
plot(g, layout=layout1)

使用给定单词列表

0 个答案: