我有一个包含多个页面的pdf文件,我的目标是将pdf文件转换为图像,清理并处理OCR中的文本。我可以将它与一个图像很好地工作,但是对于多个图像,我无法映射或SQL Server Execution Times:
CPU time = 47 ms, elapsed time = 331 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
magickimage:
lapply
给出了预期的错误:
multi_images <- map(multi_file_list, image_read)
image_cleaner <- function(images){
images <- map(images, function(x){
images %>%
image_crop(geometry_area(width = 1290, height = 950, y_off = 285, x_off = 380)) %>%
image_write(format = 'png', density = '300x300') %>%
tesseract::ocr(tesseract(options = list(preserve_interword_spaces = 1)))
})
}
那么如何访问对象的magick-image列表?我注意到this similar问题没有答案
答案 0 :(得分:1)
此方法有效,请注意,我将map
更改为Map
另外,您的函数应该在循环内调用x
image_cleaner <- function(images){
Map(function(x){
# change images %>%
# to
# x %>%
x %>%
image_crop(geometry_area(width = 1290, height = 950, y_off = 285, x_off = 380)) %>%
image_write(format = 'png', density = '300x300') %>%
tesseract::ocr(tesseract(options = list(preserve_interword_spaces = 1)))
}, images)
}
dat <- image_cleaner(multi_images)
> mapply(nchar,dat, USE.NAMES = F)
[1] 12 288 124