我有一组文本文件,必须从中提取出生日期。以下代码能够从大多数文件中提取出生日期,但以以下格式给出时会失败。我可以知道如何提取DOB吗?数据非常不统一。
数据:
data="""
Thomas, John - DOB/Sex: 12/23/1955 11/15/2014 11:53 AM"
Jacob's Date of birth is 9/15/1963
Name:Annie; DOB:10/30/1970
代码:
import re
pattern = re.compile(r'.*DOB.*((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?(?:\/|-)\d{2,4})).*',re.I)
matches=pattern.findall(data)
for match in matches:
print(match)
预期输出:
12/23/1955
答案 0 :(得分:2)
import re
string = "DOB/Sex: 12/23/1955 11/15/2014 11:53 AM"
re.findall(r'.*?DOB.*?:\s+([\d/]+)', string)
输出:
['12/23/1955']
答案 1 :(得分:1)
import re
data="""
Thomas, John - DOB/Sex: 12/23/1955 11/15/2014 11:53 AM"
Jacob's Date of birth is 9/15/1963
Name:Annie; DOB:10/30/1970
"""
pattern = re.compile(r'.*?\b(?:DOB|Date of birth)\b.*?(\d{1,2}[/-]\d{1,2}[/-](?:\d\d){1,2})',re.I)
matches=pattern.findall(data)
for match in matches:
print(match)
输出:
12/23/1955
9/15/1963
10/30/1970
说明:
.*? : 0 or more anycharacter but newline
\b : word boundary
(?: : start non capture group
DOB : literally
| : OR
Date of birth : literally
) : end group
\b : word boundary
.*? : 0 or more anycharacter but newline
( : start group 1
\d{1,2} : 1 or 2 digits
[/-] : slash or dash
\d{1,2} : 1 or 2 digits
[/-] : slash or dash
(?: : start non capture group
\d\d : 2 digits
){1,2} : end group may appear 1 or twice (ie; 2 OR 4 digits)
) : end capture group 1