Question

我希望能够将PDF文档导入R并将它们分类为：

相关（包含特定字符串，例如＆＃34; tacos＆＃34;，在前100个单词中）
不相关（不会包含＆＃34; tacos＆＃34;在前100个单词中）

更具体地说，我想谈谈以下问题：

R中是否存在包以执行此基本分类？
如果是这样的话，是否有可能生成一个在R中看起来像这样的数据集，如果我有2个PDF文档，Paper1包含至少一个字符串实例，＆＃34; tacos＆＃34;，在第一个100个单词和Paper2，它们不包含至少一个字符串实例，＆＃34; tacos＆＃34;：

对于此类分类使用R 的文档/ R包/示例R代码或模拟示例的任何引用将不胜感激！谢谢！

Answer 1

您可以使用pdftools库并执行以下操作：

首先，加载库并获取一些pdf文件名：

library(pdftools)
fns <- list.files("~/Documents", pattern = "\\.pdf$", full = TRUE)
fns <- sample(fns, 5) # sample of 5 pdf filenames...

然后定义一个以文本形式读取PDF文件的函数，并查找第一个n个单词。（检查errros可能很有用，例如未知密码或类似的东西 - 我的例子函数会在这种情况下返回NA。）

isRelevant <- function(fn, needle, n = 100L, ...) {
  res <- try({
    txt <- pdf_text(fn)
    txt <- scan(text = txt, what = "character", quote = "", quiet = TRUE) 
    any(grepl(needle, txt[1:n], ...))
  }, silent = TRUE)
  if (inherits(res, "try-error")) NA else res
}
res <- sapply(fns, isRelevant, needle = "mail", ignore.case=TRUE)

最后，将其包装起来并将其放入数据框中：

data.frame(
  Document = basename(fns), 
  Classification = dplyr::if_else(res, "relevant", "not relevant", "unknown")
)
#   Document  Classification
# 1    a.pdf        relevant
# 2    b.pdf    not relevant
# 3    c.pdf        relevant
# 4    d.pdf    not relevant
# 5    e.pdf        relevant

Answer 2

虽然@lukeA打败了我，但我写了另一个使用pdftools的小函数。唯一真正的区别是，lukeA查看了第一个n - 字符，而我的skript查看了第一个n字。

这就是我的方法看起来

library(pdftools)
library(dplyr) # for data_frames and bind_rows

# to find the files better
setwd("~/Desktop/pdftask/")

# list all files in the folder "pdfs"
pdf_files <- list.files("pdfs/", full.names = T)


# write a small function that takes a vector of paths to pdf-files, a search term,
# and a number of words (i.e., look at the first 100 words)
search_pdf <- function(pdf_files, search_term, n_words = 100) {
  # loop over the files 
  res_list <- lapply(pdf_files, function(file) {
    # use the library pdftools::pdf_text to extract the text from the pdf
    content <- pdf_text(file)

    # do some cleanup, i.e., remove punctuation, new-lines and lower all letters
    content2 <- tolower(content)
    content2 <- gsub("\\n", "", content2)
    content2 <- gsub("[[:punct:]]", "", content2)

    # split up the text by spaces
    content_vec <- strsplit(content2, " ")[[1]]

    # look if the search term is within the first n_words words
    found <- search_term %in% content_vec[1:n_words]

    # create a data_frame that holds our data
    res <- data_frame(file = file, 
                      relevance = ifelse(found, 
                                         "Relevant",
                                         "Irrelevant"))
    return(res)
  }) 

  # bind the data to a "tidy" data_frame
  res_df <- bind_rows(res_list)
  return(res_df)
}

search_pdf(pdf_files, search_term = "taco", n_words = 100)

# # A tibble: 3 × 2
#                          file  relevance
#                         <chr>      <chr>
# 1         pdfs//pdf_empty.pdf Irrelevant
# 2         pdfs//pdf_taco1.pdf   Relevant
# 3 pdfs//pdf_taco_above100.pdf Irrelevant

根据R中特定单词的存在/不存在对PDF文档文档进行分类

2 个答案: