如何清理多个Magick图像

时间:2019-10-03 02:17:47

标签: r imagemagick tesseract

我有一个包含多个页面的pdf文件,我的目标是将pdf文件转换为图像,清理并处理OCR中的文本。我可以将它与一个图像很好地工作,但是对于多个图像,我无法映射或SQL Server Execution Times: CPU time = 47 ms, elapsed time = 331 ms. SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms. magickimage:

lapply

给出了预期的错误:

multi_images <- map(multi_file_list, image_read)

image_cleaner <- function(images){

  images <- map(images, function(x){

images %>%
      image_crop(geometry_area(width = 1290, height = 950, y_off = 285, x_off = 380)) %>%  
      image_write(format = 'png', density = '300x300') %>%
      tesseract::ocr(tesseract(options = list(preserve_interword_spaces = 1)))



  })


}

那么如何访问对象的magick-image列表?我注意到this similar问题没有答案

1 个答案:

答案 0 :(得分:1)

此方法有效,请注意,我将map更改为Map

另外,您的函数应该在循环内调用x

image_cleaner <- function(images){
    Map(function(x){
        # change images %>% 
        # to 
        # x %>%
        x %>%
            image_crop(geometry_area(width = 1290, height = 950, y_off = 285, x_off = 380)) %>%  
            image_write(format = 'png', density = '300x300') %>%
            tesseract::ocr(tesseract(options = list(preserve_interword_spaces = 1)))

    }, images)


}

dat <- image_cleaner(multi_images)

> mapply(nchar,dat, USE.NAMES = F)
[1]  12 288 124