通过python中的正则表达式提取多种日期格式

时间:2019-03-06 07:21:35

标签: python regex python-3.x

我正在尝试从python中的文本中提取日期。这些是其中可能的文本和日期模式。

"Auction details: 14 December 2016, Pukekohe Park"
"Auction details: 17 Feb 2017, Gold Sacs Road"
"Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)"
"Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)"
"Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)"
"Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)"
"Auction details: Thursday, 28th February '19"
"Auction details: Friday, 1st February '19"

这是我到目前为止所写的

mon = ' (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?) '
day1 = r'\d{1,2}'
day_test = r'\d{1,2}(?:th)|\d{1,2}(?:st)' 
year1 = r'\d{4}'
year2 = r'\(\d{4}\)'
dummy = r'.*'

这捕获了案例1,2。

match = re.search(day1 + mon + year1, "Auction details: 14 December 2016, Pukekohe Park")
print match.group()

这在某种程度上反映了案例3,4,5。但是它会打印文本中的所有内容,因此在以下情况下,我想要2016年11月25日,但是下面的正则表达式模式使我可以在11月25日下午3:00进行操作。 (现场)(2016)。

所以问题1:如何仅在此处获取日期?

match = re.search(day1 + mon + dummy + year2, "Friday 25 Nov 3:00 p.m. (On Site)(2016)")
print match.group()

问题2:同样,如何捕获案例6,7和8?正则表达式应该是什么?

如果没有,是否还有其他更好的方法可以从这些格式中捕获日期?

1 个答案:

答案 0 :(得分:3)

您可以尝试

((?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s+\d{1,2}(?:st|nd|rd|th)?|\d{1,2}(?:st|nd|rd|th)?\s+(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)))(?:.*(\b\d{2}(?:\d{2})?\b))?

请参见regex demo

请注意,我使正则表达式块中的所有组均不捕获((Nov|Dec)-> (?:Nov|Dec)),在日数字模式之后添加了(?:st|nd|rd|th)?可选组,将年份匹配模式更改为{ {1}},以便它只将4位或2位数字的块作为整个单词匹配,并创建了一个替换组以说明日期在月份之前的日期,反之亦然。

将日期和月份捕获到组1中,将年份捕获到组2中,因此结果是两者的串联。

注意:如果您需要以更安全的方式匹配年份,则可能需要精确确定年份格式。例如,如果您要避免匹配\b\d{2}(?:\d{2})?\b之后的4位数或2位数的整个单词,请在其后面添加一个负数:

:

此外,您可以在整个模式周围添加单词边界,以确保整个单词匹配。

这里是Python demo

year1 = r'\b(?<!:)\d{2}(?:\d{2})?\b'
            ^^^^^^

输出:

import re
mon = r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)'
day1 = r'\d{1,2}(?:st|nd|rd|th)?'
year1 = r'\b\d{2}(?:\d{2})?\b'
dummy = r'.*'

rx = r"((?:{smon}\s+{sday1}|{sday1}\s+{smon}))(?:{sdummy}({syear1}))?".format(smon=mon, sday1=day1, sdummy=dummy, syear1=year1)
# Or, try this if a partial number before a date is parsed as day:
# rx = r"\b((?:{smon}\s+{sday1}|{sday1}\s+{smon}))(?:{sdummy}({syear1}))?".format(smon=mon, sday1=day1, sdummy=dummy, syear1=year1)
strs = ["Auction details: 14 December 2016, Pukekohe Park","Auction details: 17 Feb 2017, Gold Sacs Road","Auction details: Wednesday 27 Apr 1:00 p.m. (On site)(2016)","Auction details: Wednesday 27 Apr 1:00 p.m. (In Rooms - 923 Whangaa Rd, Man)(2016)","Auction details: Wed 27 Apr 2:00 p.m., 48 Viaduct Harbour Ave, Auckland, (2016)","Auction details: November 16 Wednesday 2:00pm at 48 Viaduct Harbour Ave, Auckland(2016)","Auction details: Thursday, 28th February '19","Auction details: Friday, 1st February '19","Friday 25 Nov 3:00 p.m. (On Site)(2016)"]  
for s in strs:
    print(s)
    m = re.search(rx, s)
    if m:
        print("{} {}".format(m.group(1), m.group(2)))
    else:
        print("NO MATCH")