如何在TermDocumentmatrix中显示文档名称(PDF)

时间:2017-05-09 09:05:26

标签: r pdf

在检查我的termdocumentmatrix时,列标题显示为数字而不是文件名(PDF' s)。

          Docs
Terms      **10 11 2 3  4 5 6 7 8  9**
  abil      1  2 0 0  0 0 0 0 0  1
  abl       0  1 0 0  6 0 1 0 0  0
  access    4  6 0 0  3 0 0 0 0  1
  accord    0  2 1 0  2 0 0 0 0  2
  account   3  2 0 0  0 0 1 0 0  1
  activ     5 18 2 5 14 1 3 2 2 10
  addit     3  1 2 0  0 1 2 0 3  2
  address   1  1 2 1  0 0 0 0 2  3
  adequ     0  2 0 0  2 2 0 0 0  1
  adequaci  1  0 0 0  1 1 0 0 2  2

这是我到目前为止的步骤:

setwd("E:/OneDrive/Thesis/Received comments document/Consultation 14")
getwd()
library(pdftools)
files <- list.files(pattern = "pdf$")
comments <- lapply(files, pdf_text)
corp <- Corpus(VectorSource(comments))
Comments.tdm <- TermDocumentMatrix(corp, control = list(removePunctuation = TRUE,
    stopwords = TRUE,
    tolower = TRUE,
    stemming = TRUE,
    removeNumbers = TRUE,
bounds = list(global = c(3, Inf)))`) 

inspect(Comments.tdm[1:11,])

我试图通过使用:

来解决这个问题
meta(corp[[1]], tag = "id") <- files[1]

返回错误消息:

**Error in `[.data.frame`(x$dmeta, tag) : undefined columns selected**

如何确保列标题显示PDF的文件名?

0 个答案:

没有答案