使用Python Regex将多个日期格式匹配并更改为一个

时间:2016-03-31 01:22:50

标签: python regex datetime format

我需要在Python中实现一个函数,该函数能够从输入字符串中检索多种日期格式,将它们更改为一种特定格式并仅返回日期:

Format            Example Input String  

MMDDYYYY          foo.bar.02242015.txt
MMDDYY            foo.bar.022415.txt
MONCCYY           foo.bar.FEB2015.txt
YYYY-MM-DD        foo_bar_2015-02-01_2015-02-28.txt
YYYYMMDD          foo_bar_20150224.txt
MM_YY             foo_bar_02_15.txt
YYYYMMDD          foo_bar_20150224.txt

输出:只是一个固定的8位数日期格式(没有foo,bar或txt):

YYYYMMDD (e.g. 20120524)

示例:

Input                     Output
foo.bar.02242015.txt  ->  20150224  

一些要求:

  1. 如果缺少日期,请添加该月的最后一天:
    foo_02_15.txt -> 20150228
  2. 如果年份是2位数,请将其更改为4:
    foo_02_24_16.txt -> 20160224
  3. 有效年份为当前或上一年,目前为:2016年或2015年
  4. 如果月份不是数字,例如FEB,将其更改为2位数字:
    foo.FEB2015.txt -> 20150228
  5. 格式' YYYY-MM-DD'总是包含两个日期,取第二个日期:
    foo_2015-02-01_2015-02-28.txt -> 20150228
  6. 任何人都知道如何在Python中使用Regex?或者最佳做法是什么?

2 个答案:

答案 0 :(得分:0)

UPDATE2 请尝试以下方法(python 2.7):

import re
import calendar

INPUT = ['foo.bar.02242015.txt',
        'foo.bar.022415.txt',
        'foo.bar.FEB2015.txt',
        'foo_bar_2015-02-01_2015-02-28.txt',
        'foo_bar_20150224.txt',
        'foo_bar_02_15.txt',
        'foo_bar_20150224.txt' ]
P1 = r'(0[1-9]|1[0-2])(0[1-9]|[12][0-9]|3[01])((?:19|20)?\d{2})'
P2 = r'[A-Z]{3}[12]\d{3}|[12]\d{3}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])_?|(?:0[1-9]|1[0-2])_[12]\d'
MONTHS = ['JAN','FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC']

def StrFormat(date_string):
    m2 = re.findall(P2, date_string)
    if m2:
        for m in m2:
            if len(m) == 5:
                month, year = m.split('_')[0], '20' +  m.split('_')[1]
                last_day = calendar.monthrange(int(year), int(month))[1]
                date_string = re.sub(P2, year+month+ str(last_day), date_string, 1)
            elif len(m) == 7:
                month, year = str(MONTHS.index(m[0:3]) + 1).zfill(2), m[3:]
                last_day = calendar.monthrange(int(year), int(month))[1]
                date_string = re.sub(P2, year+month+ str(last_day), date_string, 1)
            elif len(m) == 10:
                date_string =  re.sub(P2, m.replace('-', ''), date_string, 1)
            elif len(m) > 5:
                date_string =  re.sub(P2, '', date_string, 1)

    m1 = re.findall(P1, date_string)
    if m1:
        for m in m1:
            if len(m[2]) == 2:
                date_string = re.sub(P1, r'20\3\1\2', date_string, 1)
            elif len(m[2]) == 4:
                date_string = re.sub(P1, r'\3\1\2', date_string, 1)
            elif len(m) > 2:
                date_string = re.sub(P1, '', date_string, 1)
    return date_string


for i in INPUT:
    print i.ljust(35), '->', StrFormat(i).rjust(20)

输出:

foo.bar.02242015.txt                -> foo.bar.20150224.txt
foo.bar.022415.txt                  -> foo.bar.20150224.txt
foo.bar.FEB2015.txt                 -> foo.bar.20150228.txt
foo_bar_2015-02-01_2015-02-28.txt   -> foo_bar_20150228.txt
foo_bar_20150224.txt                -> foo_bar_20150224.txt
foo_bar_02_15.txt                   -> foo_bar_20150228.txt
foo_bar_20150224.txt                -> foo_bar_20150224.txt

顺便说一下:正如noob所建议的那样10% Regex + 90% programming: - )

答案 1 :(得分:0)

试试这个:

import re
import time
import datetime
import calendar

p = re.compile(ur'(?<=\.|_)([A-Z\d+_-]*?([A-Z\d+_-]{0,10}))(?=\.)')
test_str = u"Format            Example Input String  \n\nMMDDYYYY          foo.bar.02242015.txt\nMMDDYY            foo.bar.022415.txt\nMONCCYY           foo.bar.FEB2015.txt\nYYYY-MM-DD        foo_bar_2015-02-01_2015-02-28.txt\nYYYYMMDD          foo_bar_20150224.txt\nMM_YY             foo_bar_02_15.txt\nYYYYMMDD          foo_bar_20150224.txt"
def changedate(date):
    try:
        t = time.strptime(date,'%m%d%Y')
    except:
        pass
    try:
        t = time.strptime(date,'%m%d%y')
    except:
        pass
    try:
        t = time.strptime(date,'%b%Y')
        lastday = calendar.monthrange(int(t.tm_year), int(t.tm_mon))[1]
        t = time.strptime(date + str(lastday),'%b%Y%d')
    except:
        pass
    try:
        t = time.strptime(date,'%m_%y')
        lastday = calendar.monthrange(int(t.tm_year), int(t.tm_mon))[1]
        t = time.strptime(date + str(lastday),'%m_%y%d')
    except:
        pass        
    try:
        t = time.strptime(date,'%Y-%m-%d')
    except:
        pass
    try:
        r = time.strftime("%Y%m%d",t)
        return r
    except:
        pass
    return date
test_str = re.sub(p,lambda m: changedate(m.group(2)), test_str)
print test_str

Regex Demo

输入

Format            Example Input String  

MMDDYYYY          foo.bar.02242015.txt
MMDDYY            foo.bar.022415.txt
MONCCYY           foo.bar.FEB2015.txt
YYYY-MM-DD        foo_bar_2015-02-01_2015-02-28.txt
YYYYMMDD          foo_bar_20150224.txt
MM_YY             foo_bar_02_15.txt
YYYYMMDD          foo_bar_20150224.txt

输出:

Format            Example Input String  

MMDDYYYY          foo.bar.20150224.txt
MMDDYY            foo.bar.20150224.txt
MONCCYY           foo.bar.20150228.txt
YYYY-MM-DD        foo_bar_20150228.txt
YYYYMMDD          foo_bar_20150224.txt
MM_YY             foo_bar_20150228.txt
YYYYMMDD          foo_bar_20150224.txt

<强>解释

E.g。

输入

foo_bar_2015-02-01_2015-02-28.txt

所以

(?<=\.|_)([A-Z\d+_-]*?([A-Z\d+_-]{0,10}))(?=\.)

正则表达式将日期字符串捕获到组m

1.  [182-203]   `2015-02-01_2015-02-28`
2.  [193-203]   `2015-02-28`    

m.group(0) = 2015-02-01_2015-02-28
m.group(1) = 2015-02-01_2015-02-28
m.group(2) = 2015-02-28

然后
lambda m: changedate(m.group(2))重新格式化日期时间

所以

2015-02-28无法传递其他人

    try:
        t = time.strptime(date,'%m%d%Y')
    except:
        pass

但只传递这个块

    try:
        r = time.strftime("%Y-%m-%d",t)
        return r
    except:
        pass

然后格式化

try:
    r = time.strftime("%Y%m%d",t)
    return r
except:
    pass