Question

我有100个扫描的PDF文件，我需要将它们转换为文本文件。

我首先将它们转换为png文件（请参见下面的脚本），现在我需要帮助将这100个png文件转换为100个文本文件。

library(pdftools)
library("tesseract")

#location
dest <- "P:\\TEST\\images to text"

#making loop for all files
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

#Convert files to png
sapply(myfiles, function(x)
  pdf_convert(x, format = "png", pages = NULL, 
              filenames = NULL, dpi = 600, opw = "", upw = "", verbose = TRUE))

#read files
cat(text)

我希望每个png文件都有一个文本文件：

来自：file1.png，file2.png，file3.png ...

收件人：file1.txt，file2.txt，file3.txt ...

但是实际结果是一个包含所有png文件文本的文本文件。

Answer 1

我猜您遗漏了png -> text位，但是我假设您使用了library(tesseract)。

您可以在代码中执行以下操作：

library(tesseract)
eng <- tesseract("eng")
sapply(myfiles, function(x) {
  png_file <- gsub("\\.pdf", ".png", x)
  txt_file <- gsub("\\.pdf", ".txt", x)
  pdf_convert(x, format = "png", pages = 1, 
              filenames = png_file, dpi = 600, verbose = TRUE)

  text <- ocr(png_file, engine = eng)
  cat(text, file = txt_file)
  ## just return the text string for convenience
  ## we are anyways more interested in the side effects
  text
})

将png文件转换为txt文件

1 个答案: