使用给定单词列表

时间:2016-01-18 22:56:21

标签: r text-mining adjacency-matrix

在分析了大量科学论文之后,我结束了一个约 32000术语的术语 - 术语邻接矩阵,如下例所示:

              Terms
Terms          abart abb abbieg abbiegelicht abblend abblendlicht abbrech abbrems abdeck abend
  abart            4   1      0            0       0            1       0       0      2     0
  abb              1   6      0            4       2            4       0       0      3     2
  abbieg           0   0      4            3       3            2       0       0      3     3
  abbiegelicht     0   4      3           12       7            8       1       1      5     5
  abblend          0   2      3            7      12            6       1       1      7    10
  abblendlicht     1   4      2            8       6           20       3       3     10    11
  abbrech          0   0      0            1       1            3       4       2      2     1
  abbrems          0   0      0            1       1            3       2       5      3     2
  abdeck           2   3      3            5       7           10       2       3     17     9
  abend            0   2      3            5      10           11       1       2      9    1

我收到了一份通常汇集在一起​​的单词列表。相邻出现的频率也非常高:

wordsToStudy <- c("dog","cat","cow","horse","donkey","goat", "elephant")

我的任务是分析这些词之间的关系程度,同时也发现哪些其他词(最初没有给出)也有一个高的&#34; adjacency-index&#34;用给定的术语。

例如,我假设单词&#34; mouse&#34;将出现在其他术语中,因为尽管&#34; cat&#34;可能与鼠标有最大关联,&#34; dog&#34; (因为两者都是宠物)和&#34; elephant&#34; (因为它对老鼠的恐慌),可能经常出现在彼此旁边。

我认为最大的障碍是找到不仅与给定词语有很强关联,而且彼此之间也有任何关联的词汇。如果鼠标被接受为可能的新术语,并且&#34; bull&#34;也(因为它与&#34;牛&#34;的关联),可能鼠标和公牛没有任何关系。这不是理想的,但当然可以接受。

完美的方法是能够手动确定减少的术语 - 术语邻接矩阵中包含的术语(简称50,100,200术语......)。我认为最终输出应该是邻接图,如下图所示:

Extracted from: http://www.rdatamining.com/examples/social-network-analysis

我对统计学有非常基本的了解,也许我正在寻找的东西在这门科学中有一个特定的术语。也许你知道一个包能够做到这一点,因为我不知道如何手动完成它。

我的实际代码是:

# Packages ----------------------------------------------------------------
require(RWeka)
require(tau)
require(tm)
require(tm.plugin.webmining)
require(wordcloud)
require(igraph)

#specify where is the directory of the files.
folderdir="C:/texts"
#load the corpus.
corpus <- Corpus(DirSource(folderdir, encoding = "UTF-8"), readerControl=list(reader=readPlain,language="de"))
#cleanse the corpus.
ds0.1g <- tm_map(corpus, content_transformer(tolower))
ds1.1g <- tm_map(ds0.1g, content_transformer(removeWords), stopwords("german"))
ds2.1g <- tm_map(ds1.1g, stripWhitespace)
ds3.1g <- tm_map(ds2.1g, removePunctuation)
ds4.1g <- tm_map(ds3.1g, stemDocument)
ds4.1g <- tm_map(ds4.1g, removeNumbers)
#create matrixes.
tdm.1g <- TermDocumentMatrix(ds4.1g)
dtm.1g <- DocumentTermMatrix(ds4.1g)
#reduce the sparcity.
tdm89.1g <- removeSparseTerms(tdm.1g, 0.89)
tdm9.1g  <- removeSparseTerms(tdm.1g, 0.9)
tdm91.1g <- removeSparseTerms(tdm.1g, 0.91)
tdm92.1g <- removeSparseTerms(tdm.1g, 0.92)

tdm2.1g <- tdm92.1g

# Creates a Boolean matrix (counts # docs w/terms, not raw # terms)
tdm3.1g <- inspect(tdm2.1g)
tdm3.1g[tdm3.1g>=1] <- 1

# Transform into a term-term adjacency matrix
> termMatrix.1gram <- tdm3.1g %*% t(tdm3.1g)

# build a graph from the above matrix
g <- graph.adjacency(termMatrix.1gram, weighted=T, mode = "undirected")
# remove loops
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)

# set seed to make the layout reproducible
set.seed(3952)
layout1 <- layout.fruchterman.reingold(g)

# set label size of vertices based on their degrees and width and transparency of edges based on their weights
V(g)$label.cex <- 2.2 * V(g)$degree / max(V(g)$degree)+ .2
V(g)$label.color <- rgb(0, 0, .2, .8)
V(g)$frame.color <- NA
egam <- (log(E(g)$weight)+.4) / max(log(E(g)$weight)+.4)
E(g)$color <- rgb(.5, .5, 0, egam)
E(g)$width <- egam
# plot the graph in layout1
plot(g, layout=layout1)

0 个答案:

没有答案