使用R找到共生矩阵

时间:2016-03-13 17:10:01

标签: r nlp

我有一个下面提到的文本文件。

 Other methods of contraception were discussed, in the framework of
 a chart which showed both the _expected_ failure rate (theoretical,
 assumes no mistakes) and the _actual_ failure rate (based on research).
 Top of the chart was something like this:


 Method                  Expected         Actual 
 ------                 Failure Rate    Failure Rate
 Abstinence                 0%              0% 


 And NFP (Natural Family Planning) was on the bottom. The teacher even
 said, "I've had some students tell me that they can't use anything for
 birth control because they're Catholic. Well, if you're not married and
 you're a practicing Catholic, the *top* of the list is your slot, not 
 the *bottom*.  Even if you're not religious, the top of the list is
 safest."

从这个文本文件中我需要找到术语术语共生矩阵,如

 Correct format required
        a  b  c
     a  0  2  1
     b  1  0  2
     c  2  1  0 

到目前为止我所做的是我已经对

这样的单词矩阵做了一个句子
sentenc_id   words
    1        a     b     c      d     e
    2        b     c     f      g     h
    3        j     k     a      b     c

与此问题build word co-occurence edge list in R中的问题相同。但这个答案中的格式与我的格式不同。

  d <- read.table(text='sentence_id text
   1           "a b c d e"
   2           "a b b e"
   3           "b c d"
   4           "a e"', header=TRUE, as.is=TRUE)

   result.vec <- table(unlist(lapply(d$text, function(text) {
   pairs <- combn(unique(scan(text=text, what='', sep=' ')), m=2)
   interaction(pairs[1,], pairs[2,])
    }))) 

   result <- subset(data.frame(do.call(rbind, strsplit(names(result.vec),
   '\\.')), freq=as.vector(result.vec)), freq > 0)
     with(result, result[order(X1, X2),])

这是我现在正在使用的代码,但它没有为共生矩阵制作正确的格式,它正在制作以下格式。

wrong format
#    X1 X2 freq
# 1   a  b    2
# 5   a  c    1
# 9   a  d    1
# 13  a  e    3
# 6   b  c    2
# 10  b  d    2
# 14  b  e    2
# 11  c  d    2
# 15  c  e    1
# 16  d  e    1

1 个答案:

答案 0 :(得分:1)

通过术语文档矩阵完成它我找到了术语术语共生矩阵。

library(Matrix);
library(Rcpp);
#library(wordspace);
library(NLP);
library(tm);
library(qdap);
library(reshape2);
library(MASS);
#library(stringr);
#library(gtools);
#library(SnowballC);

#install.packages("reshape2")

txt <- system.file("Doc50", "", package = "tm")

(ovid <- VCorpus(DirSource(txt),
                 readerControl = list(language = "en")))

ovid <- tm_map(ovid , removeWords, stopwords("english"))
ovid <- tm_map(ovid , removePunctuation)
ovid <- tm_map(ovid , stripWhitespace)
ovid <-  tm_map(ovid, removeNumbers)

termDocMatrix <- TermDocumentMatrix(ovid )

termDocMatrix <- as.matrix(termDocMatrix)

colnamesmdsm <- rownames(termDocMatrix)
intersected <- intersect(colnamesmdsm,qdapDictionaries::GradyAugmented)

termDocMatrix <- termDocMatrix[intersected,]

termDocMatrix[termDocMatrix>=1] <- 1
 # transform into a term-term adjacency matrix
termMatrix <- termDocMatrix %*% t(termDocMatrix)