从结构不规则的字符串中提取日期

时间:2018-12-19 19:23:14

标签: python regex python-3.x datetime

我正在尝试从字符串中提取日期信息。该字符串可能类似于:

  1. 5个月17个小时
  2. 1个月零19天
  3. 3个月零1天
  4. 2年1个月零2天
  5. 1年1个月零1天1小时

我想提取:

  1. y = 0 m = 5 d = 0 h = 17
  2. y = 0 m = 1 d = 19 h = 0
  3. y = 0 m = 3 d = 1 h = 0
  4. y = 2 m = 1 d = 2 h = 0
  5. y = 1 m = 1 d = 1 h = 1

我开始像这样工作:

publishedWhen = '1 year 1 month and 1 days and 1 hour'

y,m,d,h = 0,0,0,0

if 'day ' in publishedWhen:
    d = int(publishedWhen.split(' day ')[0])

if 'days ' in publishedWhen:
    d = int(publishedWhen.split(' days ')[0])

if 'days ' not in publishedWhen and 'day ' not in publishedWhen:
    d = 0

if 'month ' in publishedWhen:
    m = int(publishedWhen.split(' month ')[0])
    d = int(publishedWhen.replace(publishedWhen.split(' month ')[0] + ' month ','').replace('and','').replace('days','').replace('day',''))

if 'months ' in publishedWhen:
    m = int(publishedWhen.split(' months ')[0])

但是,我知道该代码存在很多错误(某些情况下可能未考虑在内),而正则表达式可能会产生更整洁有效的内容。这是真的?哪个正则表达式可以帮助我提取所有这些信息?

1 个答案:

答案 0 :(得分:5)

don't have to use re\gular expres{2}ions?,而是在Python软件包索引中查看了非常丰富的第三方软件包库。

例如,您可以使用dateparser(用于解析人类可读日期)和CustomerClassifier(用于relative delta object)的组合:

dateutil

打印:

from datetime import datetime

import dateparser as dateparser
from dateutil.relativedelta import relativedelta


BASE_DATE = datetime(2018, 1, 1)


def get_relative_date(date_string):
    parsed_date = dateparser.parse(date_string, settings={"RELATIVE_BASE": BASE_DATE})
    return relativedelta(parsed_date, BASE_DATE)


date_strings = [
    "5 months and 17 hours",
    "1 month and 19 days",
    "3 months and 1 day",
    "2 years 1 month and 2 days",
    "1 year 1 month and 1 days and 1 hour"
]

for date_string in date_strings:
    delta = get_relative_date(date_string)
    print(f"y={abs(delta.years)} m={abs(delta.months)} d={abs(delta.days)} h={abs(delta.hours)}")

我不特别喜欢需要使用一些基准日期来执行增量,并且非常确定有一个可以直接解析为增量对象的程序包。打开任何建议。