Question

为了提高OCR质量，我需要对扫描图像进行预处理。有时候我需要用几张图片来对图像进行OCR（页面上的组件和它们处于不同的角度 - 例如，一次扫描一些纸质文档），例如：

是否可以自动以编程方式将这些图像划分为包含每个逻辑文档的单独图像？例如使用ImageMagick等工具？这种问题是否存在任何解决方案/技术？

Answer 1

在ImageMagick 6中，您可以模糊图像，使文本重叠并达到阈值，这样文本框就是白色背景上的一个大黑区域。然后，您可以使用连接组件查找每个单独的黑色灰色（0）区域及其边界框。然后使用边界框值裁剪每个此类区域的原始图像。

输入：

Unix语法（将模糊调整到足够大以使文本区域保持黑色）：

infile="image.png"
inname=`convert -ping $infile -format "%t" info:`
OLDIFS=$IFS
IFS=$'\n'
arr=(`convert $infile -blur 0x5 -auto-level -threshold 99% -type bilevel +write tmp.png \
-define connected-components:verbose=true \
-connected-components 8 \
null: | tail -n +2 | sed 's/^[ ]*//'`)
num=${#arr[*]}
IFS=$OLDIFS
for ((i=0; i<num; i++)); do
#echo "${arr[$i]}"
color=`echo ${arr[$i]} | cut -d\  -f5`
bbox=`echo ${arr[$i]} | cut -d\  -f2`
echo "color=$color; bbox=$bbox"
if [ "$color" = "gray(0)" ]; then
convert $infile -crop $bbox +repage -fuzz 10% -trim +repage ${inname}_$i.png
fi
done

文字清单：

color=gray(255); bbox=892x1008+0+0
color=gray(0); bbox=337x430+36+13
color=gray(0); bbox=430x337+266+630
color=gray(0); bbox=202x147+506+252

tmp.png显示模糊和阈值区域：

裁剪图片：

Answer 2

alexanoid wrote: I have added another image with scanning artifacts. Will this approach work on such images also?

No it will not work well for several reasons. The second image you provide was much larger than the first. So it would need a much larger blur. It is jpg and has artifacts in it. JPG is not a good format, since the image in 'constant' regions is not really constant. The blur will pick up your artifacts and will need to have a different threshold to remove some of them. In your case, the top of the image has a good sized artifact that will get caught as an object. Finally your blurred and thresholded text region's bounding boxes overlap even if they do not touch. Thus one crop may include text from other regions.

Here is my test command to blur and threshold your image:

convert image.jpg -blur 0x50 -auto-level -threshold 95% -type bilevel tmp.png

以编程方式将扫描图像划分为单独的图像

2 个答案: