我希望能够将PDF文档导入R并将它们分类为:
更具体地说,我想谈谈以下问题:
答案 0 :(得分:4)
您可以使用pdftools
库并执行以下操作:
首先,加载库并获取一些pdf文件名:
library(pdftools)
fns <- list.files("~/Documents", pattern = "\\.pdf$", full = TRUE)
fns <- sample(fns, 5) # sample of 5 pdf filenames...
然后定义一个以文本形式读取PDF文件的函数,并查找第一个n
个单词。 (检查errros可能很有用,例如未知密码或类似的东西 - 我的例子函数会在这种情况下返回NA
。)
isRelevant <- function(fn, needle, n = 100L, ...) {
res <- try({
txt <- pdf_text(fn)
txt <- scan(text = txt, what = "character", quote = "", quiet = TRUE)
any(grepl(needle, txt[1:n], ...))
}, silent = TRUE)
if (inherits(res, "try-error")) NA else res
}
res <- sapply(fns, isRelevant, needle = "mail", ignore.case=TRUE)
最后,将其包装起来并将其放入数据框中:
data.frame(
Document = basename(fns),
Classification = dplyr::if_else(res, "relevant", "not relevant", "unknown")
)
# Document Classification
# 1 a.pdf relevant
# 2 b.pdf not relevant
# 3 c.pdf relevant
# 4 d.pdf not relevant
# 5 e.pdf relevant
答案 1 :(得分:3)
虽然@lukeA打败了我,但我写了另一个使用pdftools的小函数。唯一真正的区别是,lukeA查看了第一个n
- 字符,而我的skript查看了第一个n
字。
这就是我的方法看起来
library(pdftools)
library(dplyr) # for data_frames and bind_rows
# to find the files better
setwd("~/Desktop/pdftask/")
# list all files in the folder "pdfs"
pdf_files <- list.files("pdfs/", full.names = T)
# write a small function that takes a vector of paths to pdf-files, a search term,
# and a number of words (i.e., look at the first 100 words)
search_pdf <- function(pdf_files, search_term, n_words = 100) {
# loop over the files
res_list <- lapply(pdf_files, function(file) {
# use the library pdftools::pdf_text to extract the text from the pdf
content <- pdf_text(file)
# do some cleanup, i.e., remove punctuation, new-lines and lower all letters
content2 <- tolower(content)
content2 <- gsub("\\n", "", content2)
content2 <- gsub("[[:punct:]]", "", content2)
# split up the text by spaces
content_vec <- strsplit(content2, " ")[[1]]
# look if the search term is within the first n_words words
found <- search_term %in% content_vec[1:n_words]
# create a data_frame that holds our data
res <- data_frame(file = file,
relevance = ifelse(found,
"Relevant",
"Irrelevant"))
return(res)
})
# bind the data to a "tidy" data_frame
res_df <- bind_rows(res_list)
return(res_df)
}
search_pdf(pdf_files, search_term = "taco", n_words = 100)
# # A tibble: 3 × 2
# file relevance
# <chr> <chr>
# 1 pdfs//pdf_empty.pdf Irrelevant
# 2 pdfs//pdf_taco1.pdf Relevant
# 3 pdfs//pdf_taco_above100.pdf Irrelevant