我有两个文件术语矩阵。我无法将这两个矩阵的并集作为两个不同长度的矩阵。
A <- data.frame(name = c(
"X-ray right leg arteries",
"x-ray left shoulder",
"x-ray leg arteries",
"x-ray leg with 20km distance"
), stringsAsFactors = F)
B <- data.frame(name = c(
"X-ray left leg arteries",
"X-ray leg",
"xray right leg",
"X-ray right leg arteries"
), stringsAsFactors = F)
library(tm)
# A
doc_corpus <- Corpus(VectorSource(A$name))
control_list <- list(weighting=weightBin, removePunctuation = TRUE, stopwords = TRUE, tolower = TRUE)
dtm <- DocumentTermMatrix(doc_corpus, control = control_list)
tf <- as.matrix(dtm)
# B
doc_corpus2 <- Corpus(VectorSource(B$name))
control_list2 <- list(weighting=weightBin, removePunctuation = TRUE, stopwords = TRUE, tolower = TRUE)
dtm2 <- DocumentTermMatrix(doc_corpus2, control = control_list)
tf2 <- as.matrix(dtm2)
TF [1,]
arteries leg ray right left shoulder 20km distance
1 1 1 1 0 0 0 0
TF2 [4,]
arteries left leg ray right xray
1 0 1 1 1 0
如果我对这两个矩阵的乘法求和,则返回3.应该是4.要修复它,
sum(tf[1,][tf[1,]==1] * tf2[4,][tf2[4,]==1])
但它在计算时不考虑这些术语。例如,比较tf[1,]
和tf2[1,]
sum(tf[1,][tf[1,]==1] * tf2[1,][tf2[1,]==1])
它应该是3,但它会返回到4。
我正在执行上述计算以计算余弦相似度(参见下面的公式)。
similarity = (sum(tf[1,] * tf2[4,])) / ( sqrt(sum(tf2[4,] ^ 2)) * sqrt( sum(tf[1,] ^ 2)))
答案 0 :(得分:1)
这种方式不仅更直接,而且还保持了对象的完全稀疏性。为了使用问题中的方法计算余弦相似度,您可以将可能非常大的文档项矩阵强制转换为密集矩阵。以下方法避免了这种情况。
BFunctionalInterface instance = new BFunctionalInterface() {
@Override
public void doWork() {
}
};
instance.doSomeWork();
System.out.println("WUK WUK");
library("quanteda")
corp1 <- corpus(A, text_field = "name")
corp2 <- corpus(B, text_field = "name")
# unnecessary but better for distiguishing documents
docnames(corp1) <- paste("A", seq_len(ndoc(corp1)), sep = ".")
docnames(corp2) <- paste("B", seq_len(ndoc(corp2)), sep = ".")
方法会神奇地加入它们并匹配这些功能。 (注意:执行此操作的等效方法是使用rbind.dfm()
运算符合并语料库对象:+
。试试吧!)
dtm2 <- dfm(corp1 + corp2)
然后将(稀疏)计算余弦相似性矩阵作为dtm3 <- rbind(dfm(corp1), dfm(corp2))
dtm3
# Document-feature matrix of: 8 documents, 10 features (65% sparse).
# 8 x 10 sparse Matrix of class "dfm"
# features
# docs x-ray right leg arteries left shoulder with 20km distance xray
# A.1 1 1 1 1 0 0 0 0 0 0
# A.2 1 0 0 0 1 1 0 0 0 0
# A.3 1 0 1 1 0 0 0 0 0 0
# A.4 1 0 1 0 0 0 1 1 1 0
# B.1 1 0 1 1 1 0 0 0 0 0
# B.2 1 0 1 0 0 0 0 0 0 0
# B.3 0 1 1 0 0 0 0 0 0 1
# B.4 1 1 1 1 0 0 0 0 0 0
类对象:
dist
答案 1 :(得分:0)
无需使用bind_rows()
您可以将两个corpus
或两个dtm
与c()
如文档中所述:
tm_combine:将几个语料库合并为一个语料库 多个文档组成一个语料库,结合多个术语 - 文档 将矩阵转换为单个矩阵,或组合多个术语频率向量 到单个术语 - 文档矩阵。
#S3 method for class 'VCorpus' c(..., recursive = FALSE) ##S3 method for class 'TextDocument' c(..., recursive = FALSE) ##S3 method for class 'TermDocumentMatrix' c(..., recursive = FALSE) ##S3 method for class 'term_frequency' c(..., recursive = FALSE)
使用你的dtm和dtm2:
dtm3 <- c(dtm, dtm2)
as.matrix(dtm3)
Terms
Docs arteries leg ray right left shoulder 20km distance xray
1 1 1 1 1 0 0 0 0 0
2 0 0 1 0 1 1 0 0 0
3 1 1 1 0 0 0 0 0 0
4 0 1 1 0 0 0 1 1 0
1 1 1 1 0 1 0 0 0 0
2 0 1 1 0 0 0 0 0 0
3 0 1 0 1 0 0 0 0 1
4 1 1 1 1 0 0 0 0 0