在r中尝试使用pdf进行OCR,它给了我错误。 运行代码后," i.txt"文件也已生成,但仍然出现错误。
pdftoppm version 4.00
Copyright 1996-2017 Glyph & Cog, LLC
Usage: pdftoppm [options] <PDF-file> <PPM-root>
-f <int> : first page to print
-l <int> : last page to print
-r <number> : resolution, in DPI (default is 150)
-mono : generate a monochrome PBM file
-gray : generate a grayscale PGM file
-freetype <string>: enable FreeType font rasterizer: yes, no
-aa <string> : enable font anti-aliasing: yes, no
-aaVector <string>: enable vector anti-aliasing: yes, no
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
-q : don't print any messages or errors
-cfg <string> : configuration file to use in place of .xpdfrc
-v : print copyright and version info
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information
convert.exe: unable to open image '*.ppm': Invalid argument @ error/blob.c/OpenBlob/3146.
convert.exe: no images defined `D:/PDF_OCR_File/test.pdf.tif' @ error/convert.c/ConvertImageCommand/3275.
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Error in fopenReadStream: file not found
Error in findFileFormat: image file not found
Error during processing.
[[1]]
[1] FALSE
Warning messages:
1: running command 'C:\Windows\system32\cmd.exe /c "D:/Software_for_PDF_OCR/xpdf-tools-win-4.00/bin64/pdftoppm.exe D:/PDF_OCR_File/test.pdf -f 1 -l 2 -r 600 ocrbook"' had status 99
2: In shell(shQuote(paste0("D:/Software_for_PDF_OCR/xpdf-tools-win-4.00/bin64/pdftoppm.exe ", :
'"D:/Software_for_PDF_OCR/xpdf-tools-win-4.00/bin64/pdftoppm.exe D:/PDF_OCR_File/test.pdf -f 1 -l 2 -r 600 ocrbook"' execution failed with error code 99
3: running command 'C:\Windows\system32\cmd.exe /c "D:/Software_for_PDF_OCR/ImageMagick-7.0.7-Q16/convert.exe *.ppm D:/PDF_OCR_File/test.pdf.tif"' had status 1
4: In shell(shQuote(paste0("D:/Software_for_PDF_OCR/ImageMagick-7.0.7-Q16/convert.exe *.ppm ", :
'"D:/Software_for_PDF_OCR/ImageMagick-7.0.7-Q16/convert.exe *.ppm D:/PDF_OCR_File/test.pdf.tif"' execution failed with error code 1
5: running command 'C:\Windows\system32\cmd.exe /c "D:/Software_for_PDF_OCR/Tesseract-OCR/tesseract.exe D:/PDF_OCR_File/test.pdf.tif D:/PDF_OCR_File/test.pdf -l eng"' had status 1
6: In shell(shQuote(paste0("D:/Software_for_PDF_OCR/Tesseract-OCR/tesseract.exe ", :
'"D:/Software_for_PDF_OCR/Tesseract-OCR/tesseract.exe D:/PDF_OCR_File/test.pdf.tif D:/PDF_OCR_File/test.pdf -l eng"' execution failed with error code 1
7: In file.remove(paste0(i, ".tiff")) :
cannot remove file 'D:/PDF_OCR_File/test.pdf.tiff', reason 'No such file or directory'
我的setwd()是&#34; D:/ PDF_OCR_File&#34;
这是我收到错误的代码
dest <- "D:/PDF_OCR_File"
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
sapply(myfiles, FUN = function(i){
file.rename(from = i, to = paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
lapply(myfiles, function(i){
shell(shQuote(paste0("D:/Software_for_PDF_OCR/xpdf-tools-win-4.00/bin64/pdftoppm.exe ", i, " -f 1 -l 2 -r 600 ocrbook")))
# convert ppm to tif ready for tesseract
shell(shQuote(paste0("D:/Software_for_PDF_OCR/ImageMagick-7.0.7-Q16/convert.exe *.ppm ", i, ".tif")))
# convert tif to text file
shell(shQuote(paste0("D:/Software_for_PDF_OCR/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
# delete tif file
file.remove(paste0(i, ".tiff" ))
})
我不知道哪里出错了,或者我犯了什么错误。 任何建议都会有所帮助, 谢谢。
答案 0 :(得分:0)
我打赌你使用this作为代码,例如,是吗? 我发现该代码存在很多问题以及一些过时的语法。
我提出的解决方案是:
dest <- "C:\\users\\YOURNAME\\desktop"
files <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
sapply(files, FUN = function(a){
file.rename(from = a, to = paste0(dirname(a), "/", gsub(" ", "", basename(a))))
})
files <- tools::file_path_sans_ext(list.files(path = dest, pattern = "pdf", full.names = TRUE))
lapply(files, function(i){
shell(shQuote(paste0("pdftoppm -f 1 -l 10 -r 70 ", i,".pdf", " ",i)))
})
myppms <- tools::file_path_sans_ext(list.files(path = dest, pattern = "ppm", full.names = TRUE))
lapply(myppms, function(y){
shell(shQuote(paste0("magick ", y,".ppm"," ",y,".tif")))
file.remove(paste0(y,".ppm"))
})
mytiffs <- tools::file_path_sans_ext(list.files(path = dest, pattern = "tif", full.names = TRUE))
lapply(mytiffs, function(z){
shell(shQuote(paste0("tesseract ", z,".tif", " ",z)))
file.remove(paste0(z,".tif"))
})
GitHub片段的第一个问题是这些选项都是缺失的部分,并且错误的地方让CMD无法理解,这就是您获取帮助菜单的原因。 “ocrbook”是输出文件名(如果你想要多个文件,这是不好的),所以你将得到一个PPM,PNG,无论名为“ocrbook-000001.png”的文件。该代码块中函数(i)的问题在于它正在寻找“originalpdfname.pdf.png”而不是转换为“ocrbook-000001”的文件名。我通过在函数中创建一个函数来修复它,找到PNG文件并将它们放入(z)。
Tesseract [应该]转换PNG文件就好了,所以不需要使用ImageMagick来从PPM转换为TIFF。只需使用xPDF将PDF转换为PNG即可。但是,在GitHub示例中,ImageMagick语法已过时。 “转换”显然与另一个CMD命令冲突,因此在以后的迭代中它被改为“magick”。见here。为了保持一致性,我还是使用了示例中的转换器。
该代码示例的另一个问题是tesseract默认为英语...这可能是使用较新版本创建的内容,因此不再需要再指定“-l eng”。见here。 “out”显然是导出的txt文件名(纯粹来自观察),你需要删除路径并在函数中使用它,以便模仿原始文件名,并且每次运行时都不会覆盖OCR在新文件上。