Question

我已经开始从事自己的项目，我们需要将PDF中的数据提取到CSV中。因此，我们尝试使用“ tm”和“ pdftools”方法获取数据。但这失败了，因为数据是加密的，或者是像印度语，孟加拉语或泰米尔语这样的本地语言。

数据来源：-

英语PDF-http://ceodelhi.gov.in/ConstituentyDetailENG1.aspx?num=yww4Q9JSiKPyyVZ89sYMeA==&ii=e

印地语PDF-http://ceo.bihar.gov.in/pdfsearch/draftroll.aspx

因此，我们考虑使用OCR通过以下方法读取数据。 https://gist.github.com/benmarwick/11333467

我已经遵循了该过程，但是由于以下错误而被卡住了，这似乎是imagemagick错误。但是找不到合适的解决方案。

代码：

lapply(myfiles, function(i){
  # convert pdf to ppm (an image format), just pages 1-10 of the PDF
  # but you can change that easily, just remove or edit the 
  # -f 1 -l 10 bit in the line below
  shell(shQuote(paste0("pdftoppm ", i, " -f 1 -l 10 -r 600 ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("convert *.ppm ", i, ".tif")))
  # convert tif to text file
  shell(shQuote(paste0("tesseract ", i, ".tif ", i, " -l eng")))
  # delete tif file
  file.remove(paste0(i, ".tif" ))
})

错误消息：

convert: unable to open image '*.ppm': Invalid argument @ error/blob.c/OpenBlob/3485

此数据将有助于即将在2019年在印度举行的大选

在R中使用OCR，Imagemagick读取PDF文件-错误“转换：无法打开图像'* .ppm'”

0 个答案: