Question

我有一堆日期我正在尝试使用tesseract进行OCR。但是，日期中的许多数字与日期框中的行合并为：

Digits intersecting boxes

此外，这是一个很好的形象，我可以很好地评价： Good Date Image

这是我的代码：

import os
import cv2
from matplotlib import pyplot as plt
import subprocess
import numpy as np
from PIL import Image

def show(img):
    plt.figure(figsize=(20,20))
    plt.imshow(img,cmap='gray')
    plt.show()

def sort_contours(cnts, method="left-to-right"):
    # initialize the reverse flag and sort index
    reverse = False
    i = 0

    # handle if we need to sort in reverse
    if method == "right-to-left" or method == "bottom-to-top":
        reverse = True

    # handle if we are sorting against the y-coordinate rather than
    # the x-coordinate of the bounding box
    if method == "top-to-bottom" or method == "bottom-to-top":
        i = 1

    # construct the list of bounding boxes and sort them from top to
    # bottom
    boundingBoxes = [cv2.boundingRect(c) for c in cnts]

    cnts, boundingBoxes = zip(*sorted(zip(cnts, boundingBoxes),
        key=lambda b:b[1][i], reverse=reverse))

    # return the list of sorted contours and bounding boxes
    return cnts, boundingBoxes


def tesseract_it(contours,main_img, label,delete_last_contour=False):
    min_limit, max_limit = (1300,1700)
    idx =0 
    roi_list = []
    slist= set()
    for cnt in contours:
        idx += 1
        x,y,w,h = cv2.boundingRect(cnt)
        if label=='boxes':
            roi=main_img[y+2:y+h-2,x+2:x+w-2]
        else:
            roi=main_img[y:y+h,x:x+w]

        if w*h > min_limit and w*h < max_limit and w>10 and w< 50 and h>10 and h<50:
            if (x,y,w,h) not in slist: # Stops from identifying repeted contours

                roi = cv2.resize(roi,dsize=(45,45),fx=0 ,fy=0, interpolation = cv2.INTER_AREA)
                roi_list.append(roi)
                slist.add((x,y,w,h))

    if not delete_last_contour:
        vis = np.concatenate((roi_list),1)
    else:
        roi_list.pop(-1)
        vis = np.concatenate((roi_list),1)

    show(vis)

    # Tesseract the final image here
    # ...


image = 'bad_digit/1.jpg'
# image = 'bad_digit/good.jpg'
specimen_orig = cv2.imread(image,0)


specimen = cv2.fastNlMeansDenoising(specimen_orig)
#     show(specimen)
kernel = np.ones((3,3), np.uint8)

# Now we erode
specimen = cv2.erode(specimen, kernel, iterations = 1)
#     show(specimen)
_, specimen = cv2.threshold(specimen, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
#     show(specimen)
specimen_canny = cv2.Canny(specimen, 0, 0)
#     show(specimen_canny)

specimen_blank_image = np.zeros((specimen.shape[0], specimen.shape[1], 3))
_,specimen_contours, retr = cv2.findContours(specimen_canny.copy(), cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE )
# print(len(specimen_contours))
cv2.drawContours(specimen_blank_image, specimen_contours, -1, 100, 2)
#     show(specimen_blank_image)
specimen_blank_image = np.zeros((specimen.shape[0], specimen.shape[1], 3))

specimen_sorted_contours, specimen_bounding_box = sort_contours(specimen_contours)

output_string = tesseract_it(specimen_sorted_contours,specimen_orig,label='boxes',)
# return output_string

附带的好图像的输出是这样的： Good output

测试此图像确实给我准确的结果。

但是，对于那些行合并为数字的行，我的输出如下所示： bad1 bad2 bad3 bad4

这些与Tesseract完全不兼容。我想知道是否有办法删除线条，只保留数字。

我也尝试了以下内容： https://docs.opencv.org/3.2.0/d1/dee/tutorial_moprh_lines_detection.html

对于我附上的图片，这似乎并不是很好。

我也试过使用imagemagick：

convert original.jpg \
\( -clone 0 -threshold 50% -negate -statistic median 200x1 \)  \
-compose lighten -composite                                    \
\( -clone 0 -threshold 50% -negate -statistic median 1x200 \)  \
-composite output.jpg

它的结果是公平的，但删除的行有点切断数字如下：

imagemagick1 imagemagick2 imagemagick3 imagemagick4

有没有更好的方法可以解决这个问题？我的最终目标是测试数字，因此最终图像确实需要非常清晰。

Answer 1

以下是一些似乎运行良好的代码。有两个阶段：

可以观察到数字比盒子略大。此外，整个图像具有强烈的水平状态。所以我们可以在水平方向上施加更强的扩张以消除大多数垂直线。
此时，OCR（例如Google's one）可以检测到大多数数字。不幸的是，它有点太好了，看到了其他东西，所以我添加了另一个更复杂且与你的特定环境相关的阶段。

以下是第一阶段后的一张图片的结果：

以下是第二阶段后的所有结果：

如你所见，它并不完美，8可以看作是B（好吧，就像我这样的人把它视为B ......但如果你的世界中只有数字，它就可以轻松修复）。还有一个＆＃34;：＆＃34;字符（来自已删除的垂直线的遗留物），我无法摆脱过多地调整代码...

C＃代码：

static void Unbox(string inputFilePath, string outputFilePath)
{
    using (var orig = new Mat(inputFilePath))
    {
        using (var gray = orig.CvtColor(ColorConversionCodes.BGR2GRAY))
        {
            using (var dst = orig.EmptyClone())
            {
                // this is what I call the "horizontal shake" pass.
                // note I use the Rect shape here, this is important
                using (var dilate = Cv2.GetStructuringElement(MorphShapes.Rect, new Size(4, 1)))
                {
                    Cv2.Dilate(gray, dst, dilate);
                }

                // erode just a bit to get back some numbers to life
                using (var erode = Cv2.GetStructuringElement(MorphShapes.Rect, new Size(2, 1)))
                {
                    Cv2.Erode(dst, dst, erode);
                }

                // at this point, good OCR will see most numbers
                // but we want to remove surrounding artifacts

                // find countours
                using (var canny = dst.Canny(0, 400))
                {
                    var contours = canny.FindContoursAsArray(RetrievalModes.List, ContourApproximationModes.ApproxSimple);

                    // compute a bounding rect for all numbers w/o boxes and artifacts
                    // this is the tricky part where we try to discard what's not related exclusively to numbers
                    var boundingRect = Rect.Empty;
                    foreach (var contour in contours)
                    {
                        // discard some small and broken polygons
                        var polygon = Cv2.ApproxPolyDP(contour, 4, true);
                        if (polygon.Length < 3)
                            continue;

                        // we want only numbers, and boxes are approx 40px wide,
                        // so let's discard box-related polygons, if any
                        // and some other artifacts that passed previous checks
                        // this quite depends on some context knowledge...
                        var rect = Cv2.BoundingRect(polygon);
                        if (rect.Width > 40 || rect.Height < 15)
                            continue;

                        boundingRect = boundingRect.X == 0 ? rect : boundingRect.Union(rect);
                    }

                    using (var final = dst.Clone(boundingRect))
                    {
                        final.SaveImage(outputFilePath);
                    }
                }
            }
        }
    }
}

Answer 2

只是一个建议，我从未尝试过。

不要试图移除杆，而是保持它们并在所有可能的杆位上训练。将条形修剪为字符限制以进行正确对齐。

将这些训练为02032018022018。我想最好模拟干净字符上的条形。

Answer 3

特别是，在Yves Daoust casus的情况下，请查看以下1中的2018 ...这几乎是"n"或四分之三整数0和8成为字母B。 2可以被解读为6。在某些情况下，0也可以被视为6等等。甚至有些可能最终会被视为＆＃34;无法识别＆＃34;如果你把网格留在原地。因此，我的方法是：

取出冗余网格信息有助于更好地识别其中包含直线的整数，如0,1, 2, 4, 5和7。
接下来使用Cascade分类器进行角色训练。

一旦移除网格并完成训练，就可以轻松检测到某些数字的曲率。这会将90-95％的假阴性命中减少为真实整数（真阳性）或转向架（真阴性）。然后你只需担心那些5-10％。

可以找到文档和示例代码信息here at OpenCV，here at Code-Robin和here at github。

图片值02032018022018：

values 02032018022018

OpenCV数字合并到周围的框中

3 个答案: