我需要创建一个相似度矩阵,下面的代码就是我到目前为止的代码。但是,结果不是我需要的。代码返回一个包含16行的矩阵,它是文档术语矩阵中8个唯一术语和workTitle中2个唯一术语的乘积。
我需要的是一个只有4行(每个标题一行)的矩阵,每行代表workTitle中每个单词与标题中每个词之间的编辑距离之和。
require(tm)
workTitle <- c("biomechanical engineer")
titles <- c("train machinist", "operations supervisor", "pharmacy tech", "mechanical engineer")
# create Corpus and a document-term matrix from the titles
titleCorpus <- Corpus(VectorSource(titles))
titleDtm <- DocumentTermMatrix(titleCorpus)
# print out the document-term matrix
inspect(titleDtm)
# calculate edit distance between every word from the test_var and the column names in the document-term matrix
d <- apply(titleDtm, 1, function(x) {
terms <- unlist(strsplit(as.character(workTitle), " "))
adist(colnames(titleDtm), terms)
})
这是上述代码的输出:
Docs
1 2 3 4
[1,] 11 11 11 11
[2,] 8 8 8 8
[3,] 3 3 3 3
[4,] 9 9 9 9
[5,] 11 11 11 11
[6,] 11 11 11 11
[7,] 10 10 10 10
[8,] 11 11 11 11
[9,] 0 0 0 0
[10,] 7 7 7 7
[11,] 8 8 8 8
[12,] 9 9 9 9
[13,] 8 8 8 8
[14,] 8 8 8 8
[15,] 7 7 7 7
[16,] 6 6 6 6
答案 0 :(得分:1)
如果我理解正确,那么如下:
terms <- as.character(Dictionary(titleDtm))
dat <- data.frame(adist(titles, terms), row.names = titles)
colnames(dat) <- terms
dat
结果是
engineer machinist mechanical operations pharmacy supervisor tech train
train machinist 12 6 11 12 11 14 12 10
operations supervisor 16 17 18 11 18 11 19 17
pharmacy tech 12 10 11 11 5 13 9 11
mechanical engineer 11 13 9 16 16 16 16 16
然后是总和
data.frame(sum = rowSums(dat))
具有以下输出
sum
train machinist 88
operations supervisor 127
pharmacy tech 82
mechanical engineer 113