我需要从文档术语矩阵创建一个相似度矩阵,以便对文档执行最大限度的捕获。到目前为止,我们只找到了距离矩阵的解决方案。尝试了dist方法,但它给了我错误的输出。有没有办法用R创建相似矩阵?我使用tm包代替以下代码但是我没有受到限制,如果有任何其他好的包,请告诉我。到目前为止的代码:
install.packages("tm")
install.packages("rJava")
install.packages("Snowball")
install.packages("RWeka")
install.packages("RWekajars")
install.packages("XML")
install.packages("openNLP")
install.packages("openNLPmodels.en")
Sys.setenv(NOAWT=TRUE)
library(XML)
library(rJava)
library(Snowball)
library(RWeka)
library(tm)
library(openNLP)
library(openNLPmodels.en)
sample = c(
"cc ee aa",
"dd bb ee",
"bb cc ee dd",
"cc ee dd aa",
"bb ee",
"cc dd aa",
"bb cc aa",
"bb cc",
"cc ee dd"
)
print(sample)
corpus <- Corpus(VectorSource(sample))
inspect(corpus)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument,language="english")
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tmTagPOS)
inspect(corpus)
dtm <- DocumentTermMatrix(corpus)
inspect(dtm)
# need to create similarity matrix here
dist(dtm, method = "manhattan", diag = FALSE, upper = FALSE)
给定样本的输出应如下所示
相似性矩阵定义为:
if (i < j)
a[i][j] = sim[i][j]
else
a[i][j] = 0