在pandas数据框中提取和解析日期

时间:2017-09-04 21:43:54

标签: python pandas date dataframe data-cleaning

我正在尝试将带有日期的凌乱笔记本转换为熊猫中的排序日期系列。

0           03/25/93 Total time of visit (in minutes):\n
1                         6/18/85 Primary Care Doctor:\n
2      sshe plans to move as of 7/8/71 In-Home Servic...
3                  7 on 9/27/75 Audit C Score Current:\n
4      2/6/96 sleep studyPain Treatment Pain Level (N...
5                      .Per 7/06/79 Movement D/O note:\n
6      4, 5/18/78 Patient's thoughts about current su...
7      10/24/89 CPT Code: 90801 - Psychiatric Diagnos...
8                           3/7/86 SOS-10 Total Score:\n
9               (4/10/71)Score-1Audit C Score Current:\n
10     (5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC...
11                         4/09/75 SOS-10 Total Score:\n
12     8/01/98 Communication with referring physician...
13     1/26/72 Communication with referring physician...
14     5/24/1990 CPT Code: 90792: With medical servic...
15     1/25/2011 CPT Code: 90792: With medical servic...

我有多种日期格式,例如04/20/2009;零九年四月二十零日; 09年4月20日; 09年4月3日。我想把所有这些转换成mm / dd / yyyy到一个新列。

到目前为止,我已经完成了

df2['date']= df2['text'].str.extractall(r'(\d{1,2}[/-]\d{1,2}[/-]\d{2,})')

此外,我不知道如何在不干扰上述代码的情况下提取仅具有mm / yy或yyyy格式日期的所有行。请记住,如果缺少日期或月份,我会将第1和第1个视为默认值。

1 个答案:

答案 0 :(得分:1)

您可以将var res = Regex.Replace(s, @"[\x0A\x0D\x09]", " "); 与正则表达式一起使用,然后应用pd.Series.str.extract

pd.to_datetime

df['Date'] = df.Text.str.extract(r'(?P<Date>\d+(?:\/\d+){2})', expand=False)\ .apply(pd.to_datetime) df Text Date 0 0 03/25/93 Total time of visit (in minutes):\n 1993-03-25 1 6/18/85 Primary Care Doctor:\n 1985-06-18 2 sshe plans to move as of 7/8/71 In-Home Servic... 1971-07-08 3 7 on 9/27/75 Audit C Score Current:\n 1975-09-27 4 2/6/96 sleep studyPain Treatment Pain Level (N... 1996-02-06 5 .Per 7/06/79 Movement D/O note:\n 1979-07-06 6 4, 5/18/78 Patient's thoughts about current su... 1978-05-18 7 10/24/89 CPT Code: 90801 - Psychiatric Diagnos... 1989-10-24 8 3/7/86 SOS-10 Total Score:\n 1986-03-07 9 (4/10/71)Score-1Audit C Score Current:\n 1971-04-10 10 (5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC... 1985-05-11 11 4/09/75 SOS-10 Total Score:\n 1975-04-09 12 8/01/98 Communication with referring physician... 1998-08-01 13 1/26/72 Communication with referring physician... 1972-01-26 14 5/24/1990 CPT Code: 90792: With medical servic... 1990-05-24 15 1/25/2011 CPT Code: 90792: With medical servic... 2011-01-25 返回一系列如下所示的字符串:

str.extract

正则表达式详细信息

array(['03/25/93', '6/18/85', '7/8/71', '9/27/75', '2/6/96', '7/06/79',
       '5/18/78', '10/24/89', '3/7/86', '4/10/71', '5/11/85', '4/09/75',
       '8/01/98', '1/26/72', '5/24/1990', '1/25/2011'], dtype=object)
  • (?P<Date>\d+(?:\/\d+){2}) - 命名捕获组
  • (?P<Date>....)一个或多个数字
  • \d+ - 非捕获组重复两次,其中
    • (?:\/\d+){2} - 转发正斜杠
    • \/ - 转发器(两次)

缺少天数的正则表达式

要处理可选的{2},需要稍加修改的正则表达式:

days

详细

  • (?P<Date>(?:\d+\/)?\d+/\d+) - 命名捕获组
  • (?P<Date>....) - 嵌套组(非捕获)(?:\d+\/)?是可选的。
  • \d+\/一个或多个数字
  • \d+逃脱正斜线

其余的都是一样的。用这个正则表达式代替当前的正则表达式。 \/将处理遗失的日子。