仅从文本文件中提取日期并忽略大数字

时间:2018-05-25 02:40:45

标签: regex python-3.x pandas

我有一个文本文件,我想从中提取所有日期但不知何故我的代码也提取其他值,如

  

程序#:10075453。

以下是该文件的一小部分示例:

Patient Name:  Mills, John      Procedure #:  10075453
October 7, 2017
Med Rec #:  747901                  Visit ID:  110408731
Patient Location:  OUTPATIENT               Patient Type:  OUTPATIENT
DOB:07/09/1943      Gender:  F  Age: 73Y    Phone:  (321)8344-0456

我可以了解如何解决这个问题吗?

doc = []
with open('Clean.txt', encoding="utf8") as file:
   for line in file:
      doc.append(line)

df = pd.Series(doc)

def date_extract():

    one = df.str.extract(r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})')

    two = df.str.extract(r'((?:\d{1,2})(?:(?:\/|-)\d{1,2})(?:(?:\/|-)\d{2,4}))')

    three = df.str.extract(r'((?:\d{1,2}(?:-|\/))?\d{4})')

    dates = pd.to_datetime(one.fillna(two).fillna(three).replace('Decemeber','December',regex=True).replace('Janaury','January',regex=True))
    return pd.Series(dates.sort_values())

0 个答案:

没有答案