Question

我在一个主文件夹中有3000个子文件夹，每个文件夹包含2个pdf。我编写了以下代码来转换文本文件中的PDF。

* all.subfolders＆lt; - list.dirs（＆＃34; #path to main folder＆＃34;，full.names = TRUE）

sapply(all.subfolders[-1], function(x) {

file <-list.files(x, full.names=TRUE)

lapply(file, function(x) system(paste('"C:\\Program Files (x86)\\xpdfbin-win-3.03\\bin64\\pdftotext.exe"', paste0('"', x, '"')), wait = FALSE))})*

但是在少数PDF中无法在文本中转换，如何将它们放在一个列表中左右。请帮忙。

Answer 1

我的声誉不足以发表评论，所以请原谅我做出这个答案，但事实并非如此。您的pdf文件可能受到保护，因此无法提取文本。使用pdf Viewer打开文档时，请尝试从这些文档中复制文本。由于保护，这可能不起作用。如果您有权提取和处理文本，您可以考虑将文件转换为图像，例如，通过ImageMagick，并在图像上应用OCR，例如通过tesseract。要开始使用，您可以参考以下脚本https://gist.github.com/benmarwick/11333467。

在回复您关于如何识别尚未转换的文件的评论时，您可以使用以下方法。我希望这是你一直在寻找的。

#retrieve all file paths
#note that you can use recursive = T to avoid looping over directories yourself
allfiles <- list.files("C:/.../mydirectory", full.names = T, recursive = T)

#split filepaths into a set of pdf and txt files
#txt files will, of course, only be the files that have been converted
pdffiles <- allfiles[grep("pdf$", allfiles)]
txtfiles <- allfiles[grep("txt$", allfiles)]

#remove file ending
pdffiles <- gsub(".pdf", "", pdffiles)
pdffiles <- gsub(".txt", "", pdffiles)

#check which files have not been converted
notconverted <- setdiff(pdffiles, txtfiles)

#if needed, file ending can be added again
#e.g. for copying the unconverted files into a separate directory or so
notconverted <- paste0(pdffiles, ".pdf")

PDF到文本文件转换

1 个答案: