正则表达式,用于提取特定的变量和值

时间:2019-05-11 21:35:25

标签: python regex regex-lookarounds regex-group regex-greedy

我正在使用Google Vision API从申请表的图像中提取文本(手写加上计算机编写)。响应是一个长字符串,如下所示。

字符串:

subprocess

整个响应对我没有用,但是我需要解析响应以获取特定字段,例如姓名,父亲的姓名,NIC号,性别,年龄,DoB,住所和联系电话。

我正在使用Python中的正则表达式库(re)为每个字段定义模式。例如:

subprocess

输出:

"A. Bank Challan
Bank Branch
ca
ABC muitce
Deposit ID VOSSÁETM-0055
Deposit Date 16 al 19
ate
B. Personal Information: Use CAPITAL letters and leave spaces between words.
Name: MUHAMMAD HANIE
Father's Name: MUHAMMAD Y AQOOB
Computerized NIC No. 44 603-5 284 355-3
D D M m rrrr
Gender: Male Age: (in years) 22 Date of Birth ( 4-08-1999
Domicile (District): Mirpuskhas Contact No. 0333-7078758
(Please do not mention converted No.)
Postal Address: Raheel Book Depo Naukot Taluka jhuddo Disstri mes.
Sindh.
Are You Government Servant: Yes
(If yes, please attach NOC)
No
✓
Religion: Muslim
✓
Non-Muslimo
C. Academic Information:
B
Intermediate/HSSC ENG Mirpuskhas Bise Match
Seience BISEmirpuskhas Match
2016
2014
Matric/SSC"

但是这些都不是可靠的模式,我不知道这种方法是否好。我也无法提取同一行的字段,例如“性别”和“年龄”。

如何解决此问题?

1 个答案:

答案 0 :(得分:1)

它可能并不健壮,但是可以设计一个表达式来提取所需的三个参数。 This tool可以帮助您做到这一点。也许,您可能想要一个带有多个边界的表达式:

(?=[A-Z])((Name:[A-Z-a-z\s]+\n|\s)|(Father\x27s\sName[A-Z-a-z\s\.]+\n|\s)|(Age:\s\(in\syears\)\s[0-9]+))

专注于您希望提取的文本可能会很好。

差异

  • 年龄:此变量似乎很容易提取
  • 姓名和父亲的姓名:您可能想要检查这两个变量中的值的外观,以便将其添加到字符列表中。我只是假设,也许这是一个char列表:[A-Z-a-z\s\.]。但是,您可以根据需要更改/简化它。

enter image description here

RegEx描述图

link可帮助您形象化表情:

enter image description here

Python测试

# -*- coding: UTF-8 -*-
import re

string = """
A. Bank Challan
Bank Branch
ca
ABC muitce
Deposit ID VOSSÁETM-0055
Deposit Date 16 al 19
ate
B. Personal Information: Use CAPITAL letters and leave spaces between words.
Name: MUHAMMAD HANIE
Father's Name: MUHAMMAD Y AQOOB
Computerized NIC No. 44 603-5 284 355-3
D D M m rrrr
Gender: Male Age: (in years) 22 Date of Birth ( 4-08-1999
Domicile (District): Mirpuskhas Contact No. 0333-7078758
(Please do not mention converted No.)
Postal Address: Raheel Book Depo Naukot Taluka jhuddo Disstri mes.
Sindh.
Are You Government Servant: Yes
(If yes, please attach NOC)
No
✓
Religion: Muslim
✓
Non-Muslimo
C. Academic Information:
B
Intermediate/HSSC ENG Mirpuskhas Bise Match
Seience BISEmirpuskhas Match
2016
2014
Matric/SSC"""
expression = r'(?=[A-Z])((Name:[A-Z-a-z\s]+\n|\s)|(Father\x27s\sName[A-Z-a-z\s\.]+\n|\s)|(Age:\s\(in\syears\)\s[0-9]+))'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(2) + "\" is a match  ")
else: 
    print(' Sorry! No matches!')

输出

YAAAY! "Name: MUHAMMAD HANIE" is a match