Question

我有一个包含多个页面的pdf文件，我的目标是将pdf文件转换为图像，清理并处理OCR中的文本。我可以将它与一个图像很好地工作，但是对于多个图像，我无法映射或SQL Server Execution Times: CPU time = 47 ms, elapsed time = 331 ms. SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms. magickimage：

lapply

给出了预期的错误：

multi_images <- map(multi_file_list, image_read)

image_cleaner <- function(images){

  images <- map(images, function(x){

images %>%
      image_crop(geometry_area(width = 1290, height = 950, y_off = 285, x_off = 380)) %>%  
      image_write(format = 'png', density = '300x300') %>%
      tesseract::ocr(tesseract(options = list(preserve_interword_spaces = 1)))



  })


}

那么如何访问对象的magick-image列表？我注意到this similar问题没有答案

Answer 1

此方法有效，请注意，我将map更改为Map

另外，您的函数应该在循环内调用x

image_cleaner <- function(images){
    Map(function(x){
        # change images %>% 
        # to 
        # x %>%
        x %>%
            image_crop(geometry_area(width = 1290, height = 950, y_off = 285, x_off = 380)) %>%  
            image_write(format = 'png', density = '300x300') %>%
            tesseract::ocr(tesseract(options = list(preserve_interword_spaces = 1)))

    }, images)


}

dat <- image_cleaner(multi_images)

> mapply(nchar,dat, USE.NAMES = F)
[1]  12 288 124

如何清理多个Magick图像

1 个答案: