根据R中特定单词的存在/不存在对PDF文档文档进行分类

时间:2017-06-05 13:40:53

标签: r text-mining

我希望能够将PDF文档导入R并将它们分类为:

  • 相关(包含特定字符串,例如" tacos",在前100个单词中)
  • 不相关(不会包含" tacos"在前100个单词中)

更具体地说,我想谈谈以下问题:

  1. R中是否存在包以执行此基本分类?
  2. 如果是这样的话,是否有可能生成一个在R中看起来像这样的数据集,如果我有2个PDF文档,Paper1包含至少一个字符串实例," tacos",在第一个100个单词和Paper2,它们不包含至少一个字符串实例," tacos":
  3. enter image description here

    对于此类分类使用R 的文档/ R包/示例R代码或模拟示例的任何引用将不胜感激!谢谢!

2 个答案:

答案 0 :(得分:4)

您可以使用pdftools库并执行以下操作:

首先,加载库并获取一些pdf文件名:

library(pdftools)
fns <- list.files("~/Documents", pattern = "\\.pdf$", full = TRUE)
fns <- sample(fns, 5) # sample of 5 pdf filenames... 

然后定义一个以文本形式读取PDF文件的函数,并查找第一个n个单词。 (检查errros可能很有用,例如未知密码或类似的东西 - 我的例子函数会在这种情况下返回NA。)

isRelevant <- function(fn, needle, n = 100L, ...) {
  res <- try({
    txt <- pdf_text(fn)
    txt <- scan(text = txt, what = "character", quote = "", quiet = TRUE) 
    any(grepl(needle, txt[1:n], ...))
  }, silent = TRUE)
  if (inherits(res, "try-error")) NA else res
}
res <- sapply(fns, isRelevant, needle = "mail", ignore.case=TRUE)

最后,将其包装起来并将其放入数据框中:

data.frame(
  Document = basename(fns), 
  Classification = dplyr::if_else(res, "relevant", "not relevant", "unknown")
)
#   Document  Classification
# 1    a.pdf        relevant
# 2    b.pdf    not relevant
# 3    c.pdf        relevant
# 4    d.pdf    not relevant
# 5    e.pdf        relevant

答案 1 :(得分:3)

虽然@lukeA打败了我,但我写了另一个使用pdftools的小函数。唯一真正的区别是,lukeA查看了第一个n - 字符,而我的skript查看了第一个n字。

这就是我的方法看起来

library(pdftools)
library(dplyr) # for data_frames and bind_rows

# to find the files better
setwd("~/Desktop/pdftask/")

# list all files in the folder "pdfs"
pdf_files <- list.files("pdfs/", full.names = T)


# write a small function that takes a vector of paths to pdf-files, a search term,
# and a number of words (i.e., look at the first 100 words)
search_pdf <- function(pdf_files, search_term, n_words = 100) {
  # loop over the files 
  res_list <- lapply(pdf_files, function(file) {
    # use the library pdftools::pdf_text to extract the text from the pdf
    content <- pdf_text(file)

    # do some cleanup, i.e., remove punctuation, new-lines and lower all letters
    content2 <- tolower(content)
    content2 <- gsub("\\n", "", content2)
    content2 <- gsub("[[:punct:]]", "", content2)

    # split up the text by spaces
    content_vec <- strsplit(content2, " ")[[1]]

    # look if the search term is within the first n_words words
    found <- search_term %in% content_vec[1:n_words]

    # create a data_frame that holds our data
    res <- data_frame(file = file, 
                      relevance = ifelse(found, 
                                         "Relevant",
                                         "Irrelevant"))
    return(res)
  }) 

  # bind the data to a "tidy" data_frame
  res_df <- bind_rows(res_list)
  return(res_df)
}

search_pdf(pdf_files, search_term = "taco", n_words = 100)

# # A tibble: 3 × 2
#                          file  relevance
#                         <chr>      <chr>
# 1         pdfs//pdf_empty.pdf Irrelevant
# 2         pdfs//pdf_taco1.pdf   Relevant
# 3 pdfs//pdf_taco_above100.pdf Irrelevant