Question

我正在尝试从一个pdf文本构建term document matrix。当我检查term document matrix时，我明白了。

<<TermDocumentMatrix (terms: 7245, documents:342)>>

文档的数量应该是1而不是342，而342是pdf文件中的页数。我尝试使用R。

使用此代码

pdf_file <- file.path(("Lat/web"), "textpdf.pdf")
text <- pdf_text(pdf_file)
myCorpus <- Corpus(VectorSource(text))

mytdm <- TermDocumentMatrix(myCorpus, control = list
                         (removeNumbers = TRUE, 
                         removePunctuation = TRUE, 
                         stopwords=stopwords_en, 
                         stemming=TRUE)
)
inspect(mytdm)

Answer 1

使用以下代码将pdf页面折叠为1个文档。

pdf_file <- file.path(("Lat/web"), "textpdf.pdf")
text <- pdf_text(pdf_file)
# collapse pdf pages into 1
text <- paste(unlist(text), collapse ="")
.....
rest of code

从PDF文件构建术语文档矩阵

1 个答案: