我正在使用Google Vision API从申请表的图像中提取文本(手写加上计算机编写)。响应是一个长字符串,如下所示。
字符串:
subprocess
整个响应对我没有用,但是我需要解析响应以获取特定字段,例如姓名,父亲的姓名,NIC号,性别,年龄,DoB,住所和联系电话。
我正在使用Python中的正则表达式库(re)为每个字段定义模式。例如:
subprocess
输出:
"A. Bank Challan
Bank Branch
ca
ABC muitce
Deposit ID VOSSÁETM-0055
Deposit Date 16 al 19
ate
B. Personal Information: Use CAPITAL letters and leave spaces between words.
Name: MUHAMMAD HANIE
Father's Name: MUHAMMAD Y AQOOB
Computerized NIC No. 44 603-5 284 355-3
D D M m rrrr
Gender: Male Age: (in years) 22 Date of Birth ( 4-08-1999
Domicile (District): Mirpuskhas Contact No. 0333-7078758
(Please do not mention converted No.)
Postal Address: Raheel Book Depo Naukot Taluka jhuddo Disstri mes.
Sindh.
Are You Government Servant: Yes
(If yes, please attach NOC)
No
✓
Religion: Muslim
✓
Non-Muslimo
C. Academic Information:
B
Intermediate/HSSC ENG Mirpuskhas Bise Match
Seience BISEmirpuskhas Match
2016
2014
Matric/SSC"
但是这些都不是可靠的模式,我不知道这种方法是否好。我也无法提取同一行的字段,例如“性别”和“年龄”。
如何解决此问题?
答案 0 :(得分:1)
它可能并不健壮,但是可以设计一个表达式来提取所需的三个参数。 This tool可以帮助您做到这一点。也许,您可能想要一个带有多个边界的表达式:
(?=[A-Z])((Name:[A-Z-a-z\s]+\n|\s)|(Father\x27s\sName[A-Z-a-z\s\.]+\n|\s)|(Age:\s\(in\syears\)\s[0-9]+))
专注于您希望提取的文本可能会很好。
[A-Z-a-z\s\.]
。但是,您可以根据需要更改/简化它。此link可帮助您形象化表情:
# -*- coding: UTF-8 -*-
import re
string = """
A. Bank Challan
Bank Branch
ca
ABC muitce
Deposit ID VOSSÁETM-0055
Deposit Date 16 al 19
ate
B. Personal Information: Use CAPITAL letters and leave spaces between words.
Name: MUHAMMAD HANIE
Father's Name: MUHAMMAD Y AQOOB
Computerized NIC No. 44 603-5 284 355-3
D D M m rrrr
Gender: Male Age: (in years) 22 Date of Birth ( 4-08-1999
Domicile (District): Mirpuskhas Contact No. 0333-7078758
(Please do not mention converted No.)
Postal Address: Raheel Book Depo Naukot Taluka jhuddo Disstri mes.
Sindh.
Are You Government Servant: Yes
(If yes, please attach NOC)
No
✓
Religion: Muslim
✓
Non-Muslimo
C. Academic Information:
B
Intermediate/HSSC ENG Mirpuskhas Bise Match
Seience BISEmirpuskhas Match
2016
2014
Matric/SSC"""
expression = r'(?=[A-Z])((Name:[A-Z-a-z\s]+\n|\s)|(Father\x27s\sName[A-Z-a-z\s\.]+\n|\s)|(Age:\s\(in\syears\)\s[0-9]+))'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(2) + "\" is a match ")
else:
print(' Sorry! No matches!')
YAAAY! "Name: MUHAMMAD HANIE" is a match