我正在尝试使用tm package
计算R中不同pdf文档的单词频率。但就我这样做的方式而言,我只能独立计算单词。我想算一下考虑词干的话。例如:如果我使用关键字“water”,我想将“水”和“水”统计在一起。到目前为止,它是脚本。
library(NLP); library(SnowballC);library(tm); library(pdftools)
setwd("C:/Users/Guido/Dropbox/NBSAPs_ed/English")
# To grab those files ending with “pdf”
files <- list.files(pattern = "pdf$")
# To extract text is pdf_text.
NBSAPs <- lapply(files, pdf_text)
# Create a corpus.
NBSAPs_corp <- Corpus(VectorSource(NBSAPs))
# To creating the term-document matrix.
NBSAPs_tdm <- TermDocumentMatrix(NBSAPs_corp, control = list(removePunctuation = TRUE,
tolower = TRUE,
removeNumbers = TRUE))
# To inspect the 10 first arrows.
inspect(NBSAPs_tdm[1:10,])
# To convert as matrix
NBSAPs_table <- as.matrix(NBSAPs_tdm)
#Columns Names
names<- NULL
for(i in files){
names[i] <- paste0(i)
}
colnames(NBSAPs_table) <- names
# Table for keywords
keywords <- c("water")
final_NBSAPs_table <- NBSAPs_table[keywords, ]
row.names(final_NBSAPs_tab