在Python中使用Dateutil时,某些日期格式的提取失败

时间:2017-09-14 13:11:56

标签: python python-2.7 parsing string-parsing python-dateutil

我在发布此问题之前已经通过了多个链接,所以请通读以及下面的两个答案已经解决了我的问题的90%:

parse multiple dates using dateutil

How to parse multiple dates from a block of text in Python (or another language)

问题:我需要在Python中解析多种格式的多个日期

通过上方链接解决方案:我能够这样做,但仍有某些格式我无法这样做。

仍无法解析的格式为:

  1. text ='我想参加5月16日至5月18日'

  2. text ='我想从5月16日至18日访问

  3. text ='我想参加5月6日的5月6日'

  4. 我也尝试过正则表达式,但由于日期可以采用任何格式,因此排除了该选项,因为代码变得非常复杂。因此,请建议我对链接上显示的代码进行修改,以便同样可以处理上述3种格式。

1 个答案:

答案 0 :(得分:1)

这种问题总是需要使用新的边缘情况进行调整,但以下方法相当强大:

from itertools import groupby, izip_longest
from datetime import datetime, timedelta
import calendar
import string
import re


def get_date_part(x):
    if x.lower() in month_list:
        return x

    day = re.match(r'(\d+)(\b|st|nd|rd|th)', x, re.I)

    if day:
        return day.group(1)

    return False


def month_full(month):
    try:
        return datetime.strptime(month, '%B').strftime('%b')
    except:
        return datetime.strptime(month, '%b').strftime('%b')

tests = [
    'I want to visit from May 16-May 18',
    'I want to visit from May 16-18',
    'I want to visit from May 6 May 18',
    'May 6,7,8,9,10',
    '8 May to 10 June',
    'July 10/20/30',
    'from June 1, july 5 to aug 5 please',
    '2nd March to the 3rd January',
    '15 march, 10 feb, 5 jan',
    '1 nov 2017',
    '27th Oct 2010 until 1st jan',
    '27th Oct 2010 until 1st jan 2012'
    ]

cur_year = 2017    

month_list = [m.lower() for m in list(calendar.month_name) + list(calendar.month_abbr) if len(m)]
remove_punc = string.maketrans(string.punctuation, ' ' * len(string.punctuation))

for date in tests:
    date_parts = [get_date_part(part) for part in date.translate(remove_punc).split() if get_date_part(part)]

    days = []
    months = []
    years = []

    for k, g in groupby(sorted(date_parts, key=lambda x: x.isdigit()), lambda y: not y.isdigit()):
        values = list(g)

        if k:
            months = map(month_full, values)
        else:
            for v in values:
                if 1900 <= int(v) <= 2100:
                    years.append(int(v))
                else:
                    days.append(v)

        if days and months:
            if years:
                dates_raw = [datetime.strptime('{} {} {}'.format(m, d, y), '%b %d %Y') for m, d, y in izip_longest(months, days, years, fillvalue=years[0])]            
            else:
                dates_raw = [datetime.strptime('{} {}'.format(m, d), '%b %d').replace(year=cur_year) for m, d in izip_longest(months, days, fillvalue=months[0])]
                years = [cur_year]

            # Fix for jumps in year
            dates = []
            start_date = datetime(years[0], 1, 1)
            next_year = years[0] + 1

            for d in dates_raw:
                if d < start_date:
                    d = d.replace(year=next_year)
                    next_year += 1
                start_date = d
                dates.append(d)

            print "{}  ->  {}".format(date, ', '.join(d.strftime("%d/%m/%Y") for d in dates))

这将转换测试字符串如下:

I want to visit from May 16-May 18  ->  16/05/2017, 18/05/2017
I want to visit from May 16-18  ->  16/05/2017, 18/05/2017
I want to visit from May 6 May 18  ->  06/05/2017, 18/05/2017
May 6,7,8,9,10  ->  06/05/2017, 07/05/2017, 08/05/2017, 09/05/2017, 10/05/2017
8 May to 10 June  ->  08/05/2017, 10/06/2017
July 10/20/30  ->  10/07/2017, 20/07/2017, 30/07/2017
from June 1, july 5 to aug 5 please  ->  01/06/2017, 05/07/2017, 05/08/2017
2nd March to the 3rd January  ->  02/03/2017, 03/01/2018
15 march, 10 feb, 5 jan  ->  15/03/2017, 10/02/2018, 05/01/2019
1 nov 2017  ->  01/11/2017
27th Oct 2010 until 1st jan  ->  27/10/2010, 01/01/2011
27th Oct 2010 until 1st jan 2012  ->  27/10/2010, 01/01/2012

其工作原理如下:

  1. 首先创建一个有效月份名称列表,即完整和缩写。

  2. 制作翻译表,以便快速删除文字中的任何标点符号。

  3. 拆分文本,并使用带正则表达式的函数仅提取日期部分以发现日期或月份。

  4. 根据部分是否为数字对列表进行排序,这会将月份分组到前面,将数字分组到末尾。

  5. 取每个列表的第一部分和最后一部分。将月份转换为完整形式,例AugAugust并将每个转换为datetime个对象。

  6. 如果日期显示在上一个日期之前,请添加一整年。