无法从给定格式提取出生日期

时间:2018-08-17 01:06:17

标签: python regex python-3.x data-extraction

我有一组文本文件,必须从中提取出生日期。以下代码能够从大多数文件中提取出生日期,但以以下格式给出时会失败。我可以知道如何提取DOB吗?数据非常不统一。

数据:

data="""
Thomas, John - DOB/Sex:    12/23/1955                                     11/15/2014   11:53 AM"
Jacob's Date of birth is 9/15/1963
Name:Annie; DOB:10/30/1970

代码:

import re    
pattern = re.compile(r'.*DOB.*((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?(?:\/|-)\d{2,4})).*',re.I)

matches=pattern.findall(data)

for match in matches:
    print(match)

预期输出:

12/23/1955

2 个答案:

答案 0 :(得分:2)

import re
string = "DOB/Sex:    12/23/1955            11/15/2014   11:53 AM"
re.findall(r'.*?DOB.*?:\s+([\d/]+)', string)

输出:

['12/23/1955']

答案 1 :(得分:1)

import re    

data="""
Thomas, John - DOB/Sex:    12/23/1955                                     11/15/2014   11:53 AM"
Jacob's Date of birth is 9/15/1963
Name:Annie; DOB:10/30/1970
"""

pattern = re.compile(r'.*?\b(?:DOB|Date of birth)\b.*?(\d{1,2}[/-]\d{1,2}[/-](?:\d\d){1,2})',re.I)

matches=pattern.findall(data)

for match in matches:
    print(match)    

输出:

12/23/1955
9/15/1963
10/30/1970

说明:

.*?             : 0 or more anycharacter but newline
\b              : word boundary
(?:             : start non capture group
  DOB           : literally
 |              : OR
  Date of birth : literally
)               : end group
\b              : word boundary
.*?             : 0 or more anycharacter but newline
(               : start group 1
    \d{1,2}     : 1 or 2 digits
    [/-]        : slash or dash
    \d{1,2}     : 1 or 2 digits
    [/-]        : slash or dash
    (?:         : start non capture group
        \d\d    : 2 digits
    ){1,2}      : end group may appear 1 or twice (ie; 2 OR 4 digits)
)               : end capture group 1