Question

我试图从包含死刑记录信息的几百个JPG中提取文字; JPG由德克萨斯州刑事司法部（TDCJ）主持。以下是删除了个人身份信息的示例代码段。

我认为下划线是正确OCR的障碍 - 如果我进去，截图子片段和手动白化线，结果OCR通过{{ 3}}非常好。但由于强调存在，它非常糟糕。

如何最好地删除这些水平线？我尝试过：

开始使用OpenCV doc的演练：pytesseract。卡住了很快，因为我知道零C ++。
跟随Extract horizontal and vertical lines by using morphological operations - 结尾是一个难以理解的字符串。
跟随Removing Horizontal Lines in image - 无法在这里确定调整零数组背后的直觉。

使用Removing long horizontal/vertical lines from edge image using OpenCV标记此问题，希望有人可以帮助将c++的第5步翻译为Python。我已经尝试了一批转换，例如Hugh Line Transform，但是我在一个图书馆和地区的黑暗中感觉到我以前没有任何经验。

import cv2

# Inverted grayscale
img = cv2.imread('rsnippet.jpg', cv2.IMREAD_GRAYSCALE)
img = cv2.bitwise_not(img)

# Transform inverted grayscale to binary
th = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                            cv2.THRESH_BINARY, 15, -2)

# An alternative; Not sure if `th` or `th2` is optimal here
th2 = cv2.threshold(img, 170, 255, cv2.THRESH_BINARY)[1]

# Create corresponding structure element for horizontal lines.
# Start by cloning th/th2.
horiz = th.copy()
r, c = horiz.shape

# Lost after here - not understanding intuition behind sizing/partitioning

Answer 1

到目前为止，所有答案似乎都在使用形态学操作。这里的东西有点不同。如果线条水平，这应该会产生相当好的结果。

为此，我使用下面显示的样本图像的一部分。

加载图像，将其转换为灰度并反转。

import cv2
import numpy as np
import matplotlib.pyplot as plt

im = cv2.imread('sample.jpg')
gray = 255 - cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)

倒置灰度图像：

如果您在此倒置图像中扫描一行，您将看到其轮廓看起来有所不同，具体取决于是否存在线条。

plt.figure(1)
plt.plot(gray[18, :] > 16, 'g-')
plt.axis([0, gray.shape[1], 0, 1.1])
plt.figure(2)
plt.plot(gray[36, :] > 16, 'r-')
plt.axis([0, gray.shape[1], 0, 1.1])

绿色的配置文件是没有下划线的行，红色是带下划线的行。如果你取每个配置文件的平均值，你会发现红色配置文件的平均值更高。

因此，使用这种方法，您可以检测下划线并将其删除。

for row in range(gray.shape[0]):
    avg = np.average(gray[row, :] > 16)
    if avg > 0.9:
        cv2.line(im, (0, row), (gray.shape[1]-1, row), (0, 0, 255))
        cv2.line(gray, (0, row), (gray.shape[1]-1, row), (0, 0, 0), 1)

cv2.imshow("gray", 255 - gray)
cv2.imshow("im", im)

以下是检测到的红色下划线和清洁后的图像。

清洁图像的tesseract输出：

Convthed as th(
shot once in the
she stepped fr<
brother-in-lawii
collect on life in
applied for man
to the scheme i|

现在应该清楚使用部分图像的原因。由于个人身份信息已在原始图像中删除，因此阈值不起作用。但是，当您将其应用于处理时，这应该不是问题。有时您可能需要调整阈值（16,0.9）。

结果看起来不太好，部分字母被移除，一些微弱的线仍然存在。如果我可以进一步改进它会更新。

更新：

取消一些改进;清理并链接字母的缺失部分。我对代码进行了评论，因此我认为这个过程很明确。您还可以检查生成的中间图像以查看其工作原理。结果好一点。

清洁图像的tesseract输出：

Convicted as th(
shot once in the
she stepped fr<
brother-in-law. ‘
collect on life ix
applied for man
to the scheme i|

清洁图像的tesseract输出：

)r-hire of 29-year-old .
revolver in the garage ‘
red that the victim‘s h
{2000 to kill her. mum
250.000. Before the kil
If$| 50.000 each on bin
to police.

python代码：

import cv2
import numpy as np
import matplotlib.pyplot as plt

im = cv2.imread('sample2.jpg')
gray = 255 - cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
# prepare a mask using Otsu threshold, then copy from original. this removes some noise
__, bw = cv2.threshold(cv2.dilate(gray, None), 128, 255, cv2.THRESH_BINARY or cv2.THRESH_OTSU)
gray = cv2.bitwise_and(gray, bw)
# make copy of the low-noise underlined image
grayu = gray.copy()
imcpy = im.copy()
# scan each row and remove lines
for row in range(gray.shape[0]):
    avg = np.average(gray[row, :] > 16)
    if avg > 0.9:
        cv2.line(im, (0, row), (gray.shape[1]-1, row), (0, 0, 255))
        cv2.line(gray, (0, row), (gray.shape[1]-1, row), (0, 0, 0), 1)

cont = gray.copy()
graycpy = gray.copy()
# after contour processing, the residual will contain small contours
residual = gray.copy()
# find contours
contours, hierarchy = cv2.findContours(cont, cv2.RETR_CCOMP, cv2.CHAIN_APPROX_SIMPLE)
for i in range(len(contours)):
    # find the boundingbox of the contour
    x, y, w, h = cv2.boundingRect(contours[i])
    if 10 < h:
        cv2.drawContours(im, contours, i, (0, 255, 0), -1)
        # if boundingbox height is higher than threshold, remove the contour from residual image
        cv2.drawContours(residual, contours, i, (0, 0, 0), -1)
    else:
        cv2.drawContours(im, contours, i, (255, 0, 0), -1)
        # if boundingbox height is less than or equal to threshold, remove the contour gray image
        cv2.drawContours(gray, contours, i, (0, 0, 0), -1)

# now the residual only contains small contours. open it to remove thin lines
st = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))
residual = cv2.morphologyEx(residual, cv2.MORPH_OPEN, st, iterations=1)
# prepare a mask for residual components
__, residual = cv2.threshold(residual, 0, 255, cv2.THRESH_BINARY)

cv2.imshow("gray", gray)
cv2.imshow("residual", residual)   

# combine the residuals. we still need to link the residuals
combined = cv2.bitwise_or(cv2.bitwise_and(graycpy, residual), gray)
# link the residuals
st = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (1, 7))
linked = cv2.morphologyEx(combined, cv2.MORPH_CLOSE, st, iterations=1)
cv2.imshow("linked", linked)
# prepare a msak from linked image
__, mask = cv2.threshold(linked, 0, 255, cv2.THRESH_BINARY)
# copy region from low-noise underlined image
clean = 255 - cv2.bitwise_and(grayu, mask)
cv2.imshow("clean", clean)
cv2.imshow("im", im)

Answer 2

可以试试这个。

img = cv2.imread('img_provided_by_op.jpg', 0)
img = cv2.bitwise_not(img)  

# (1) clean up noises
kernel_clean = np.ones((2,2),np.uint8)
cleaned = cv2.erode(img, kernel_clean, iterations=1)

# (2) Extract lines
kernel_line = np.ones((1, 5), np.uint8)  
clean_lines = cv2.erode(cleaned, kernel_line, iterations=6)
clean_lines = cv2.dilate(clean_lines, kernel_line, iterations=6)

# (3) Subtract lines
cleaned_img_without_lines = cleaned - clean_lines
cleaned_img_without_lines = cv2.bitwise_not(cleaned_img_without_lines)

plt.imshow(cleaned_img_without_lines)
plt.show()
cv2.imwrite('img_wanted.jpg', cleaned_img_without_lines)

演示

该方法基于Zaw Lin的answer。他/她识别出图像中的线条，然后进行减法以消除它们。然而，我们不能只在这里减去行，因为我们有字母 e ， t ， E ， T ， - 也包含线条！如果我们只是从图像中减去水平线， e 几乎与 c 相同。 - 将会消失......

问：我们如何找到线？

要查找行，我们可以使用erode函数。要使用erode，我们需要定义一个内核。（您可以将内核视为功能操作的窗口/形状。）

内核会滑过图像（如2D卷积）。 原始图像中的一个像素（1或0）只有在所有像素下才被认为是1 内核为1，否则会被侵蚀（变为零）。 - (Source).

要提取线条，我们将内核kernel_line定义为np.ones((1, 5))，[1, 1, 1, 1, 1]。此内核将滑过图像并侵蚀内核下具有0的像素。

更具体地说，当内核应用于一个像素时，它将捕获左侧的两个像素和右侧的两个像素。

 [X X Y X X]
      ^
      |
Applied to Y, `kernel_line` captures Y's neighbors. If any of them is not
0, Y will be set to 0.

水平线将保留在此内核下，而没有水平邻居的像素将消失。这就是我们如何使用以下行捕获行。

clean_lines = cv2.erode(cleaned, kernel_line, iterations=6)

问：我们如何避免在e，E，t，T和-？

中提取线

我们会将erosion和dilation与迭代参数合并。

clean_lines = cv2.erode(cleaned, kernel_line, iterations=6)

您可能已经注意到iterations=6部分。此参数的作用将使 e，E，t，T， - 中的平坦部分消失。这是因为虽然我们多次应用相同的操作，但这些线的边界部分将会缩小。（应用相同的内核，只有边界部分会满足0并且结果变为0.）我们使用这个技巧使这些字符中的线条消失。

然而，这带来了一个副作用，即我们想要摆脱的长下划线部分也会缩小。我们可以使用dilate来增长它！

clean_lines = cv2.dilate(clean_lines, kernel_line, iterations=6)

与缩小图像的侵蚀相反，扩张会使图像变大。虽然我们仍然拥有相同的内核kernel_line，但如果内核下的任何部分为1，则目标像素将为1.应用此内容时，边界将重新生成。（如果我们仔细挑选参数使其在侵蚀部分消失，那么 e，E，t，T， - 中的部分将不会再生长。）

通过这个额外的技巧，我们可以成功摆脱线条而不会伤害 e，E，t，T和 - 。

Answer 3

一些建议：

鉴于您从JPEG开始，不要将损失加重。将中间文件另存为PNG。 Tesseract应对那些就好了。
将图像缩放2x（使用--- - hosts: localhost gather_facts: no vars: region: us-east-1 state: present aws_ec2_specs: - image: "{{ ami_id }}" key_name: "{{ default_key_name }}" server_category: web instance_type: t2.small server_numbers_subnet: - server_numbers: '3' vpc_subnet_id: "{{ internal_subnet_ids[0] }}" - server_numbers: '4' vpc_subnet_id: "{{ internal_subnet_ids[1] }}" - server_numbers: '5' vpc_subnet_id: "{{ internal_subnet_ids[0] }}" exact_count: 1 tasks: - name: Create EC2 Instances ec2: count: "{{ item.0.count | default(omit) }}" count_tag: Name: "{{ item.0.server_category + item.1.server_numbers }}" exact_count: "{{ item.0.exact_count | default(omit) }}" image: "{{ item.0.image | mandatory }}" instance_tags: "{{ {'Name': item.0.server_category + item.1.server_numbers }|combine(item.0.instance_tags) }}" instance_type: "{{ item.0.instance_type | mandatory }}" key_name: "{{ item.0.key_name | mandatory }}" region: "{{ region | mandatory }}" vpc_subnet_id: "{{ item.1.vpc_subnet_id | default(omit) }}" state: "{{ item.0.state | default(omit) }}" with_subelements: - "{{ aws_ec2_specs }}" - server_numbers_subnet when: state == "present" register: ec2lauched）处理到Tesseract。
尝试检测并删除黑色下划线。（This question可能有帮助）。在保留下降器的同时这样做可能会很棘手。
探索Tesseract命令行选项，其中有很多（并且它们可以记录在案，有些需要潜入C ++源代码才能尝试理解它们）。它看起来像结扎带来一些悲伤。 IIRC（已经有一段时间了），有一两个可能有用的环境。

Answer 4

由于源中检测到的大多数行都是水平长行，与我的另一个答案相似，即Find single color, horizontal spaces in image

这是源图像：

以下是删除长水平线的两个主要步骤：

使用灰色图像上的长线内核进行变形关闭

kernel = np.ones((1,40), np.uint8)
morphed = cv2.morphologyEx(gray, cv2.MORPH_CLOSE, kernel)

然后，让变形图像包含长行：

反转变形图像，并添加到源图像：

dst = cv2.add(gray, (255-morphed))

然后删除长行的图像：

够简单吧？并且还存在small line segments，我认为它对OCR几乎没有影响。请注意，除g，j，p，q，y，Q外，几乎所有字符都保持原始状态，可能有点不同。但是诸如Tesseract（具有LSTM技术）之类的现代OCR工具能够处理这种简单的混淆。

0123456789ABCDEF的克喜的Ĵ KLMNO的 PQ rstuvwx的ý zABCDEFGHIJKLMNOP 问 RSTUVWXYZ < / p>

将已删除的图像保存为line_removed.png的总代码：

#!/usr/bin/python3 # 2018.01.21 16:33:42 CST import cv2 import numpy as np ## Read img = cv2.imread("img04.jpg") gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) ## (1) Create long line kernel, and do morph-close-op kernel = np.ones((1,40), np.uint8) morphed = cv2.morphologyEx(gray, cv2.MORPH_CLOSE, kernel) cv2.imwrite("line_detected.png", morphed) ## (2) Invert the morphed image, and add to the source image: dst = cv2.add(gray, (255-morphed)) cv2.imwrite("line_removed.png", dst)

更新@ 2018.01.23 13:15:15 CST：

Tesseract是一个强大的OCR工具。今天我安装了tesseract-4.0和pytesseract。然后我在我的结果pytesseract上使用line_removed.png进行ocr。

import cv2 import pytesseract img = cv2.imread("line_removed.png") print(pytesseract.image_to_string(img, lang="eng"))

这是重新审视，对我来说很好。

Convicted as the triggerman in the murder—for—hire of 29—year—old . shot once in the head with a 357 Magnum revolver in the garage of her home at .. she stepped from her car. Police discovered that the victim‘s husband, brother—in—law, _ ______ paid _ $2,000 to kill her, apparently so .. _ collect on life insurance policies totaling $250,000. Before the killing, . applied for additional life insurance policies of $150,000 each on himself and his wife to the scheme in three different statements to police. was and could had also . confessed

删除水平下划线

4 个答案:

演示

问：我们如何找到线？

问：我们如何避免在e，E，t，T和-？

更新@ 2018.01.23 13:15:15 CST：