Question

我试图在同一时间找到多个文档中出现的单词。

让我们举个例子。

doc1: "this is a document about milkyway"
doc2: "milky way is huge"

正如您在上面的2个文件中所看到的，＆＃34; milkyway＆＃34;正在两个文档中发生，但在第二个文档术语＆＃34; milkyway＆＃34;由空格分隔，在第一个doc中则不是。

我正在执行以下操作以在R中获取文档术语矩阵。

library(tm)
tmp.text <- data.frame(rbind(doc1, doc2))
tmp.corpus <- Corpus(DataframeSource(tmp.text))
tmpDTM <- TermDocumentMatrix(tmp.corpus, control = list(tolower = T, removeNumbers = T, removePunctuation = TRUE,stopwords = TRUE,wordLengths = c(2, Inf)))
tmp.df <- as.data.frame(as.matrix(tmpDTM))
tmp.df

         1 2
document 1 0
huge     0 1
milky    0 1
milkyway 1 0
way      0 1

根据上述矩阵，术语milkyway仅出现在第一个文档中。

我希望能够在两个文件中获得1个期限＆＃34; milkyway＆＃34;在上面的矩阵中。这只是一个例子。我需要为很多文件做这件事。最终，我希望能够以类似的方式对待这些词语（＆＃34; milkyway＆＃34;＆amp;＆＃34;银河系＆＃34;）。

编辑1：

我是否可以强制使用术语文档矩阵进行计算，以便对于要查找的任何单词，它不应该只是将该单词作为字符串中的单独单词查找，而是在字符串中？例如，一个术语是milky，并且有一个文档this is milkyway所以此处目前milky未出现在此文档中，但如果算法在字符串中查找相关单词，它也会在字符串milky中找到单词milkyway，这样我的两个文档（前面的示例）中都会计算单词milky和way。

编辑2：

最终我希望能够计算文档之间的相似性余弦索引。

Answer 1

您需要先将文档转换为一包原始单词表示。其中原始单词与一组单词匹配。原始单词也可以在语料库中。

例如：

milkyway -> {milky, milky way, milkyway} 
economy -> {economics, economy}
sport -> {soccer, football, basket ball, basket, NFL, NBA}

您可以在计算余弦距离之前使用同义词词典和编辑距离（如levenstein）来构建此类词典，这将完成同义词词典。

计算'运动'的关键更多。

Answer 2

您可以使用正则表达式来匹配单词的每个可能的分割，方法是插入＆＃34; \\ s？＆＃34;在搜索词中的每个字符之间。如果您只想要特定的分割，只需将它插入那些位置即可。以下代码通过插入＆＃34; \\ s？＆＃34;为搜索项生成正则表达式模式。每个角色之间。 grep返回模式匹配的索引，但可以交换其他正则表达式函数。

docs <- c("this is a document about milkyway",  "milky way is huge")
search_terms <- c("milkyway", "document")
pattern_fix <- sapply(strsplit(search_terms, split = NULL), paste0, collapse = "\\s?")
sapply(pattern_fix, grep, docs)

$`m\\s?i\\s?l\\s?k\\s?y\\s?w\\s?a\\s?y`
[1] 1 2

$`d\\s?o\\s?c\\s?u\\s?m\\s?e\\s?n\\s?t`
[1] 1

修改

要搜索所有单词，您可以在脚本中使用tmp.df的名称作为我的解决方案中的search_terms。

doc1 <- "this is a document about milkyway" doc2 <- "milky way is huge" library(tm) tmp.text<-data.frame(rbind(doc1,doc2)) tmp.corpus<-Corpus(DataframeSource(tmp.text)) tmpDTM<-TermDocumentMatrix(tmp.corpus, control= list(tolower = T, removeNumbers = T, removePunctuation = TRUE,stopwords = TRUE,wordLengths = c(2, Inf))) tmp.df<-as.data.frame(as.matrix(tmpDTM)) tmp.df search_terms <- row.names(tmp.df) pattern_fix <- sapply(strsplit(search_terms, split = NULL), paste0, collapse = "\\s?") names(pattern_fix) <- search_terms word_count <- sapply(pattern_fix, grep, tmp.text[[1]]) h_table <- sapply(word_count, function(x) table(factor(x, levels = 1:nrow(tmp.text)))) #horizontal table v_table <- t(h_table) #vertical table (like tmp.df) v_table 1 2 document 1 0 huge 0 1 milky 1 1 milkyway 1 1 way 1 1

Answer 3

这里的解决方案不需要预设的单词列表，但是通过将文本标记为bigrams来执行分离，其中相邻单词之间没有分隔符，然后在unigram标记化中查找匹配。然后保存它们，然后在文本中用分离的版本替换它们。

这意味着不需要预设列表，但只有那些未解析的文本中具有等效解析版本的列表。请注意，这可能会产生误报，例如＆＃34; berated＆＃34;和＆＃34;被评定＆＃34;这可能不是同一对的出现，而是第一个术语中的有效单字组，与第二个术语中的等效连接二元组不同。（这个特殊问题没有完美的解决方案。）

此解决方案需要 quanteda 包进行文本分析，并使用 stringi 包进行矢量化正则表达式替换。

# original example
myTexts <- c(doc1 = "this is a document about milkyway", doc2 = "milky way is huge")

require(quanteda) 

unparseMatches <- function(texts) {
    # tokenize all texts
    toks <- quanteda::tokenize(toLower(texts), simplify = TRUE)
    # tokenize bigrams
    toks2 <- quanteda::ngrams(toks, 2, concatenator = " ")
    # find out which compressed pairs exist already compressed in original tokens
    compoundTokens <- toks2[which(gsub(" ", "", toks2) %in% toks)]
    # vectorized replacement and return
    result <- stringi::stri_replace_all_fixed(texts, gsub(" ", "", compoundTokens), compoundTokens, vectorize_all = FALSE)
    # because stringi strips names
    names(result) <- names(texts)
    result
}

unparseMatches(myTexts)
##                                 doc1                                 doc2 
##  "this is a document about milky way"                 "milky way is huge" 
quanteda::dfm(unparseMatches(myTexts), verbose = FALSE)
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfmSparse"
##       features
## docs   this is a document about milky way huge
##   doc1    1  1 1        1     1     1   1    0
##   doc1    0  1 0        0     0     1   1    1


# another test, with two sets of phrases that need to be unparsed 
testText2 <- c(doc3 = "This is a super duper data set about the milky way.",
               doc4 = "And here is another superduper dataset about the milkyway.")
unparseMatches(testText2)
##                                                            doc3                                                            doc4 
##           "This is a super duper data set about the milky way." "And here is another super duper data set about the milky way." 
(myDfm <- dfm(unparseMatches(testText2), verbose = FALSE))
## Document-feature matrix of: 2 documents, 14 features.
## 2 x 14 sparse Matrix of class "dfmSparse"
##       features
## docs   this is a super duper data set about the milky way and here another
##   doc3    1  1 1     1     1    1   1     1   1     1   1   0    0       0
##   doc4    0  1 0     1     1    1   1     1   1     1   1   1    1       1

quanteda 也可以进行相似度计算，例如余弦距离：

quanteda::similarity(myDfm, "doc3", margin = "documents", method = "cosine")
##      doc4   <NA> 
##    0.7833     NA

我不确定NA是什么 - 当只有一个文档要与双文档集进行比较时，输出中似乎是错误。（我很快就会解决这个问题，但结果仍然正确。）

Answer 4

正如肯已经说过的那样：

（对这个特殊问题没有完美的解决方案。）

据我所知，这绝对是正确的，并且有许多关于文本挖掘的教科书和期刊的支持 - 通常在前几段内。

在我的研究中，我依靠已经准备好的数据集，如the „Deutscher Wortschatz“ project.，他们已经完成了艰苦的工作，并提供了高质量的同义词，反义词，多义词等的列表。这个项目i.a.通过soap提供界面访问。英语数据库是Wordnet，例如..

如果您不想使用预先计算的套装或无法承受，我建议您使用amirouche的方法和原始字表示法。用文字构建它们是繁琐且劳动密集的，但却是最可行的方法。

我想到的每一种其他方法肯定更复杂。只需看看G. Heyer，U。Quasthoff和T. Wittig的“Text Mining，Wissensrohstoff Text”中的其他答案或最先进的方法，就像通过（1）识别特征来聚类字形一样（索引 - 术语），（2）创建Term-Sentence-Matrix并选择加权来计算term-term-matrix，（3）选择相似性度量并在你的term-term-matrix上运行它最后（4））选择并运行聚类算法。

我建议你将amirouche的帖子标记为正确的答案，因为这是迄今为止最好和最切实可行的做法（我知道）。

以同样的方式处理以空格分隔的单词

4 个答案: