我有一个下面提到的文本文件。
Other methods of contraception were discussed, in the framework of
a chart which showed both the _expected_ failure rate (theoretical,
assumes no mistakes) and the _actual_ failure rate (based on research).
Top of the chart was something like this:
Method Expected Actual
------ Failure Rate Failure Rate
Abstinence 0% 0%
And NFP (Natural Family Planning) was on the bottom. The teacher even
said, "I've had some students tell me that they can't use anything for
birth control because they're Catholic. Well, if you're not married and
you're a practicing Catholic, the *top* of the list is your slot, not
the *bottom*. Even if you're not religious, the top of the list is
safest."
从这个文本文件中我需要找到术语术语共生矩阵,如
Correct format required
a b c
a 0 2 1
b 1 0 2
c 2 1 0
到目前为止我所做的是我已经对
这样的单词矩阵做了一个句子sentenc_id words
1 a b c d e
2 b c f g h
3 j k a b c
与此问题build word co-occurence edge list in R中的问题相同。但这个答案中的格式与我的格式不同。
d <- read.table(text='sentence_id text
1 "a b c d e"
2 "a b b e"
3 "b c d"
4 "a e"', header=TRUE, as.is=TRUE)
result.vec <- table(unlist(lapply(d$text, function(text) {
pairs <- combn(unique(scan(text=text, what='', sep=' ')), m=2)
interaction(pairs[1,], pairs[2,])
})))
result <- subset(data.frame(do.call(rbind, strsplit(names(result.vec),
'\\.')), freq=as.vector(result.vec)), freq > 0)
with(result, result[order(X1, X2),])
这是我现在正在使用的代码,但它没有为共生矩阵制作正确的格式,它正在制作以下格式。
wrong format
# X1 X2 freq
# 1 a b 2
# 5 a c 1
# 9 a d 1
# 13 a e 3
# 6 b c 2
# 10 b d 2
# 14 b e 2
# 11 c d 2
# 15 c e 1
# 16 d e 1
答案 0 :(得分:1)
通过术语文档矩阵完成它我找到了术语术语共生矩阵。
library(Matrix);
library(Rcpp);
#library(wordspace);
library(NLP);
library(tm);
library(qdap);
library(reshape2);
library(MASS);
#library(stringr);
#library(gtools);
#library(SnowballC);
#install.packages("reshape2")
txt <- system.file("Doc50", "", package = "tm")
(ovid <- VCorpus(DirSource(txt),
readerControl = list(language = "en")))
ovid <- tm_map(ovid , removeWords, stopwords("english"))
ovid <- tm_map(ovid , removePunctuation)
ovid <- tm_map(ovid , stripWhitespace)
ovid <- tm_map(ovid, removeNumbers)
termDocMatrix <- TermDocumentMatrix(ovid )
termDocMatrix <- as.matrix(termDocMatrix)
colnamesmdsm <- rownames(termDocMatrix)
intersected <- intersect(colnamesmdsm,qdapDictionaries::GradyAugmented)
termDocMatrix <- termDocMatrix[intersected,]
termDocMatrix[termDocMatrix>=1] <- 1
# transform into a term-term adjacency matrix
termMatrix <- termDocMatrix %*% t(termDocMatrix)