Question

我有两个文件术语矩阵。我无法将这两个矩阵的并集作为两个不同长度的矩阵。

A <- data.frame(name = c(
  "X-ray right leg arteries",
  "x-ray left shoulder",
  "x-ray leg arteries",
  "x-ray leg with 20km distance"
), stringsAsFactors = F)

B <- data.frame(name = c(
  "X-ray left leg arteries",
  "X-ray leg",
  "xray right leg",
  "X-ray right leg arteries"
), stringsAsFactors = F)


library(tm)

# A
doc_corpus <- Corpus(VectorSource(A$name))
control_list <- list(weighting=weightBin, removePunctuation = TRUE, stopwords = TRUE, tolower = TRUE)
dtm <- DocumentTermMatrix(doc_corpus, control = control_list)
tf <- as.matrix(dtm)

# B
doc_corpus2 <- Corpus(VectorSource(B$name))
control_list2 <- list(weighting=weightBin, removePunctuation = TRUE, stopwords = TRUE, tolower = TRUE)
dtm2 <- DocumentTermMatrix(doc_corpus2, control = control_list)
tf2 <- as.matrix(dtm2)

TF [1，]

arteries      leg      ray    right     left shoulder     20km distance 
       1        1        1        1        0        0        0        0

TF2 [4，]

arteries     left      leg      ray    right     xray 
       1        0        1        1        1        0

如果我对这两个矩阵的乘法求和，则返回3.应该是4.要修复它，

sum(tf[1,][tf[1,]==1] * tf2[4,][tf2[4,]==1])

但它在计算时不考虑这些术语。例如，比较tf[1,]和tf2[1,]

sum(tf[1,][tf[1,]==1] * tf2[1,][tf2[1,]==1])

它应该是3，但它会返回到4。

我正在执行上述计算以计算余弦相似度（参见下面的公式）。

similarity = (sum(tf[1,] * tf2[4,])) / ( sqrt(sum(tf2[4,] ^ 2)) * sqrt(    sum(tf[1,] ^ 2)))

Answer 1

这种方式不仅更直接，而且还保持了对象的完全稀疏性。为了使用问题中的方法计算余弦相似度，您可以将可能非常大的文档项矩阵强制转换为密集矩阵。以下方法避免了这种情况。

BFunctionalInterface instance = new BFunctionalInterface() {
    @Override
    public void doWork() {
    }
};
instance.doSomeWork();
System.out.println("WUK WUK");

library("quanteda") corp1 <- corpus(A, text_field = "name") corp2 <- corpus(B, text_field = "name") # unnecessary but better for distiguishing documents docnames(corp1) <- paste("A", seq_len(ndoc(corp1)), sep = ".") docnames(corp2) <- paste("B", seq_len(ndoc(corp2)), sep = ".")方法会神奇地加入它们并匹配这些功能。（注意：执行此操作的等效方法是使用rbind.dfm()运算符合并语料库对象：+。试试吧！）

dtm2 <- dfm(corp1 + corp2)

然后将

（稀疏）计算余弦相似性矩阵作为dtm3 <- rbind(dfm(corp1), dfm(corp2)) dtm3 # Document-feature matrix of: 8 documents, 10 features (65% sparse). # 8 x 10 sparse Matrix of class "dfm" # features # docs x-ray right leg arteries left shoulder with 20km distance xray # A.1 1 1 1 1 0 0 0 0 0 0 # A.2 1 0 0 0 1 1 0 0 0 0 # A.3 1 0 1 1 0 0 0 0 0 0 # A.4 1 0 1 0 0 0 1 1 1 0 # B.1 1 0 1 1 1 0 0 0 0 0 # B.2 1 0 1 0 0 0 0 0 0 0 # B.3 0 1 1 0 0 0 0 0 0 1 # B.4 1 1 1 1 0 0 0 0 0 0类对象：

dist

Answer 2

无需使用bind_rows()

您可以将两个corpus或两个dtm与c()

合并

如文档中所述：

tm_combine：将几个语料库合并为一个语料库多个文档组成一个语料库，结合多个术语 - 文档将矩阵转换为单个矩阵，或组合多个术语频率向量到单个术语 - 文档矩阵。
#S3 method for class 'VCorpus' 
c(..., recursive = FALSE)
##S3 method for class 'TextDocument' 
c(..., recursive = FALSE)
##S3 method for class 'TermDocumentMatrix' 
c(..., recursive = FALSE)
##S3 method for class 'term_frequency' 
c(..., recursive = FALSE)

使用你的dtm和dtm2：

dtm3 <- c(dtm, dtm2)
as.matrix(dtm3)

    Terms
Docs arteries leg ray right left shoulder 20km distance xray
   1        1   1   1     1    0        0    0        0    0
   2        0   0   1     0    1        1    0        0    0
   3        1   1   1     0    0        0    0        0    0
   4        0   1   1     0    0        0    1        1    0
   1        1   1   1     0    1        0    0        0    0
   2        0   1   1     0    0        0    0        0    0
   3        0   1   0     1    0        0    0        0    1
   4        1   1   1     1    0        0    0        0    0

R中两项矩阵的联合

2 个答案: