在r中对pdf执行ocr时出错

时间:2017-09-20 07:36:24

标签: r pdf ocr tesseract lapply

在r中尝试使用pdf进行OCR,它给了我错误。 运行代码后," i.txt"文件也已生成,但仍然出现错误。

pdftoppm version 4.00
Copyright 1996-2017 Glyph & Cog, LLC
Usage: pdftoppm [options] <PDF-file> <PPM-root>
  -f <int>          : first page to print
  -l <int>          : last page to print
  -r <number>       : resolution, in DPI (default is 150)
  -mono             : generate a monochrome PBM file
  -gray             : generate a grayscale PGM file
  -freetype <string>: enable FreeType font rasterizer: yes, no
  -aa <string>      : enable font anti-aliasing: yes, no
  -aaVector <string>: enable vector anti-aliasing: yes, no
  -opw <string>     : owner password (for encrypted files)
  -upw <string>     : user password (for encrypted files)
  -q                : don't print any messages or errors
  -cfg <string>     : configuration file to use in place of .xpdfrc
  -v                : print copyright and version info
  -h                : print usage information
  -help             : print usage information
  --help            : print usage information
  -?                : print usage information
convert.exe: unable to open image '*.ppm': Invalid argument @ error/blob.c/OpenBlob/3146.
convert.exe: no images defined `D:/PDF_OCR_File/test.pdf.tif' @ error/convert.c/ConvertImageCommand/3275.
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in fopenReadStream: file not found
Error in findFileFormat: image file not found
Error during processing.
[[1]]
[1] FALSE

Warning messages:
1: running command 'C:\Windows\system32\cmd.exe /c "D:/Software_for_PDF_OCR/xpdf-tools-win-4.00/bin64/pdftoppm.exe D:/PDF_OCR_File/test.pdf -f 1 -l 2 -r 600 ocrbook"' had status 99 
2: In shell(shQuote(paste0("D:/Software_for_PDF_OCR/xpdf-tools-win-4.00/bin64/pdftoppm.exe ",  :
  '"D:/Software_for_PDF_OCR/xpdf-tools-win-4.00/bin64/pdftoppm.exe D:/PDF_OCR_File/test.pdf -f 1 -l 2 -r 600 ocrbook"' execution failed with error code 99
3: running command 'C:\Windows\system32\cmd.exe /c "D:/Software_for_PDF_OCR/ImageMagick-7.0.7-Q16/convert.exe *.ppm D:/PDF_OCR_File/test.pdf.tif"' had status 1 
4: In shell(shQuote(paste0("D:/Software_for_PDF_OCR/ImageMagick-7.0.7-Q16/convert.exe *.ppm ",  :
  '"D:/Software_for_PDF_OCR/ImageMagick-7.0.7-Q16/convert.exe *.ppm D:/PDF_OCR_File/test.pdf.tif"' execution failed with error code 1
5: running command 'C:\Windows\system32\cmd.exe /c "D:/Software_for_PDF_OCR/Tesseract-OCR/tesseract.exe D:/PDF_OCR_File/test.pdf.tif D:/PDF_OCR_File/test.pdf -l eng"' had status 1 
6: In shell(shQuote(paste0("D:/Software_for_PDF_OCR/Tesseract-OCR/tesseract.exe ",  :
  '"D:/Software_for_PDF_OCR/Tesseract-OCR/tesseract.exe D:/PDF_OCR_File/test.pdf.tif D:/PDF_OCR_File/test.pdf -l eng"' execution failed with error code 1
7: In file.remove(paste0(i, ".tiff")) :
  cannot remove file 'D:/PDF_OCR_File/test.pdf.tiff', reason 'No such file or directory'

我的setwd()是&#34; D:/ PDF_OCR_File&#34;

这是我收到错误的代码

dest <- "D:/PDF_OCR_File"
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

sapply(myfiles, FUN = function(i){
  file.rename(from = i, to =  paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})


myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)




lapply(myfiles, function(i){

  shell(shQuote(paste0("D:/Software_for_PDF_OCR/xpdf-tools-win-4.00/bin64/pdftoppm.exe ", i, " -f 1 -l 2 -r 600 ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("D:/Software_for_PDF_OCR/ImageMagick-7.0.7-Q16/convert.exe *.ppm ", i, ".tif")))
  # convert tif to text file
  shell(shQuote(paste0("D:/Software_for_PDF_OCR/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
  # delete tif file
  file.remove(paste0(i, ".tiff" ))
})

我不知道哪里出错了,或者我犯了什么错误。 任何建议都会有所帮助, 谢谢。

1 个答案:

答案 0 :(得分:0)

我打赌你使用this作为代码,例如,是吗? 我发现该代码存在很多问题以及一些过时的语法。

我提出的解决方案是:

  dest <- "C:\\users\\YOURNAME\\desktop"

  files <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

  sapply(files, FUN = function(a){
  file.rename(from = a, to =  paste0(dirname(a), "/", gsub(" ", "", basename(a))))
      })

      files <- tools::file_path_sans_ext(list.files(path = dest, pattern = "pdf", full.names = TRUE))
    lapply(files, function(i){
      shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 70 ", i,".pdf", " ",i)))
      })


  myppms <- tools::file_path_sans_ext(list.files(path = dest, pattern = "ppm", full.names = TRUE))
    lapply(myppms, function(y){
      shell(shQuote(paste0("magick ", y,".ppm"," ",y,".tif")))
      file.remove(paste0(y,".ppm"))
      })

  mytiffs <- tools::file_path_sans_ext(list.files(path = dest, pattern = "tif", full.names = TRUE))
    lapply(mytiffs, function(z){
      shell(shQuote(paste0("tesseract ", z,".tif", " ",z)))
      file.remove(paste0(z,".tif"))
      })

GitHub片段的第一个问题是这些选项都是缺失的部分,并且错误的地方让CMD无法理解,这就是您获取帮助菜单的原因。 “ocrbook”是输出文件名(如果你想要多个文件,这是不好的),所以你将得到一个PPM,PNG,无论名为“ocrbook-000001.png”的文件。该代码块中函数(i)的问题在于它正在寻找“originalpdfname.pdf.png”而不是转换为“ocrbook-000001”的文件名。我通过在函数中创建一个函数来修复它,找到PNG文件并将它们放入(z)。

Tesseract [应该]转换PNG文件就好了,所以不需要使用ImageMagick来从PPM转换为TIFF。只需使用xPDF将PDF转换为PNG即可。但是,在GitHub示例中,ImageMagick语法已过时。 “转换”显然与另一个CMD命令冲突,因此在以后的迭代中它被改为“magick”。见here。为了保持一致性,我还是使用了示例中的转换器。

该代码示例的另一个问题是tesseract默认为英语...这可能是使用较新版本创建的内容,因此不再需要再指定“-l eng”。见here。 “out”显然是导出的txt文件名(纯粹来自观察),你需要删除路径并在函数中使用它,以便模仿原始文件名,并且每次运行时都不会覆盖OCR在新文件上。