我有一个文本文件,我想从中提取所有日期但不知何故我的代码也提取其他值,如
程序#:10075453。
以下是该文件的一小部分示例:
Patient Name: Mills, John Procedure #: 10075453
October 7, 2017
Med Rec #: 747901 Visit ID: 110408731
Patient Location: OUTPATIENT Patient Type: OUTPATIENT
DOB:07/09/1943 Gender: F Age: 73Y Phone: (321)8344-0456
我可以了解如何解决这个问题吗?
doc = []
with open('Clean.txt', encoding="utf8") as file:
for line in file:
doc.append(line)
df = pd.Series(doc)
def date_extract():
one = df.str.extract(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})')
two = df.str.extract(r'((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?:(?:\/|-)\d{2,4}))')
three = df.str.extract(r'((?:\d{1,2}(?:-|\/))?\d{4})')
dates = pd.to_datetime(one.fillna(two).fillna(three).replace('Decemeber','December',regex=True).replace('Janaury','January',regex=True))
return pd.Series(dates.sort_values())