我已经使用magick-r和tesseract的组合成功地从单个pdf文件中提取了文本,但是在尝试处理多张图像时遇到了障碍(这是针对非营利组织的)
我欢迎bash中的答案,但要求它们是全面的,不要跳过tesseract组件。
this question的答案是用于清洁图像而不使用OCR的,因此不确定在这里如何集成第一个答案。
我的过程:
library(tesseract)
library(dplyr)
library(stringr)
library(pdftools)
library(readr)
library(magick)
library(purrr)
# original data
#pdf <- https://github.com/pembletonc/Project44_Text_Extraction/blob/master/test-data/001_0145.pdf
#image file (note that size here doesn't match processing below because of 2mb limit)[![enter image description here][2]][2]
file_name <- tools::list_files_with_exts(dir = "./test-data", exts = "pdf")
page_count <- pdf_info(file_name)$pages
multi_files <- list(pdftools::pdf_convert(file_name, page = 1:page_count,
filenames = paste0("./test-data/", "page", 1:page_count, ".png"),dpi = 250))
#or just get the file extensions for the file if already created[![enter image description here][1]][1]
#multi_files <- list(tools::list_files_with_exts(dir = "./test-data", exts = "png"))
要将图像读取为magick文件:
multi_images <- map(multi_files, image_read)
which creates a tibble magick pointer object with the images sort of joined as a frame:
[[1]]
# A tibble: 5 x 7
format width height colorspace matte filesize density
<chr> <int> <int> <chr> <lgl> <int> <chr>
1 PNG 3243 2010 sRGB FALSE 0 98x98
2 PNG 3247 2013 sRGB FALSE 4515441 98x98
3 PNG 3243 2013 sRGB FALSE 4559229 98x98
4 PNG 3247 2010 sRGB FALSE 4270145 98x98
5 PNG 3247 2010 sRGB FALSE 3212528 98x98
如何在每个PNG上访问它,以便可以在OCR中进行清理和处理?
multi_text_clean <- function(images){
Map(function(x) {
x %>%
image_crop(geometry_area(width = 2200, height = 1600, y_off = 500, x_off = 650)) %>%
image_resize("2000x") %>%
image_background("white", flatten = TRUE) %>%
image_noise(noisetype = "Uniform") %>% # Reduce noise in image using a noise peak elimination filter
image_enhance() %>% # Enhance image (minimize noise)
image_normalize() %>%
image_convert(type = 'Grayscale') %>%
image_trim(fuzz = 40) %>%
image_contrast(sharpen = 1) %>%
#image_deskew(threshold = 40) %>%
image_write(format = 'png', density = '300x300') %>%
tesseract::ocr(tesseract(options = list(preserve_interword_spaces = 1)))
}, images)
}
这仅在第一张图像上运行:
text_list <- multi_text_clean(multi_images)
(text_multi <- stringr::str_split(text_list, pattern = "\\s{5,}"))
[[1]]
[1] "Weather clear all day. A small arms inspection held at 1400 hrs. A recce party went\njout consisting of Coy Comds and Lt Col Nicklin, I.0. and Asst Adjt. An Orders group\nheld in the evening. Pay parade for HQ and Bn HQ was at 1900 hrs. A movie was shown\nfor B Coy personnel by our YMCA Supervisor."
[2] ")\nWeather clear and cold all day. Personnel packed equipment early in the morning and |~\nwere ready to move at 0830 hrs. Unit embussed at 0900 hrs and moved to Rochefort, MR\n2076, Sheet 105, 1/25000, arriving at 1390 hrs. Coys were in position at 1600 hrs. |,,\nPW brought in by A Coy at 1800 hrs. PW was a deserter from 304 Regt 2 Pz division.\nNo other activity during the day. Patrols were sent out during the night by all coys}) u\nCold all day. Very quiet all morning. A Coy moved forward. Coy HQ set up at Chateawv .\n\\Vieux de Rochefort. Slight opposition met by A Coy on advance. Opposition met at\n\\Croic St Jean. A Coy was in position at 1700 hrs. Advance started at 1500 hrs. OP\nset up at 1900 hrs at MR 207753. Patrols sent out by all Coys."
[3] "“y\neather wet all day. Snowed most of the day. 1 Pl from C Coy guarding bridge MR\n204767. A Coy sent a fighting patrol to clear Powder Mill woods MR 2074. Recce\npatrols sent out byall coys."
[4] "f\nWeather fair all day. No enemy was seen during the day. A Coy sent out patrols during\ntthe day and night but no opposition memt. B Coy moved forward to MR 195771. Orders\nGroup held at 2000 hrs and orders were given to have all personnel ready to move to\nnew location by 1200 hrs on the 6 of Jan 1945. YMCA was to show a movie in the evenp\nling but the CO cancelled it. Two Polish deserters from the German army walked into\n|A Coy lines."
[5] "iz\nWeather clear all day. CO, Coy Comds, Sig Officer and Vickers Officer left to recce\nnew location at 0830 hrs. Unit started to move to new location at 1200 hrs, Unit Bs\narrived at AYE MR 2683, Sheet 91, 1\" to mile at 1500 hrs. Personnel were shown to\ntheir areas and billets."
[6] "| 9\neather clear all day. Observation Post set up by the Intelligence Sec at MR 253813.| |\nQuiet all day. No enemy activity during the day."
[7] "|\neather overcast and snowing. Intelligence Section set up another OP at MR 268814.\nNo enemy activity during the day. At 2300 hrs orders were received that all personnel\nere to be ready to move to new area on the morning of the 9th Jan, 1945."
[8] ":"
[9] "‘\nWeather clear and cold, Bm started to move at 0830 hrs. Bn reached Champlon"
[10] "&\nFamenine, MR 3182 at 1230 hrs. Bn relieved the HLI. Coys immediately took up"
[11] ":\npositions for all around defence."
[12] "4\n"
我如何遍历该magick对象中的每个图像?
答案 0 :(得分:1)
您可以在ImageMagick中执行以下操作。
输入:
convert img.jpg -negate -lat 20x20+10% -negate img_lat.jpg
或者我有一个使用ImageMagick的bash shell脚本,称为textcleaner,它将执行以下操作:
textcleaner -f 20 -o 10 img.jpg img_textcleaner.jpg