根据遮罩删除日期子字符串

时间:2020-11-02 14:19:58

标签: python regex

我有以下文字:

Filling a gap December 6, 2018 Slide 6 Small parts example. Padded details May 22, 2020 Slide 21 Adds to safety

我需要将日期+幻灯片替换为.(点)才能得到以下结果:

Filling a gap. Small parts example. Padded details. Adds to safety

可能该遮罩可用于标识要删除的文本:

{month} {day}, {year} {Slide} {slide number}

我可以使用正则表达式删除月份,如下所示:

(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)

但是如何定义蒙版并将所有内容放在一起? 不确定正则表达式是否是正确的解决方案,或者它是矫kill过正。

2 个答案:

答案 0 :(得分:3)

匹配1到31之间的日子以使其更加具体,然后滑动后跟1个或多个数字。

如果前后匹配空格,并用点和单个空格代替,则将省略双倍空格。

替换为.

\s*(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \b(?:[1-9]|[12]\d|3[01])\b,\s+\d{4} Slide \d+\s*

Regex demo

import re

pattern=r"\s*(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?) \b(?:[1-9]|[12]\d|3[01])\b,\s+\d{4} Slide \d+\s*"
s="Filling a gap December 6, 2018 Slide 6 Small parts example. Padded details May 22, 2020 Slide 21 Adds to safety"
print(re.sub(pattern, ". ", s))

输出

Filling a gap. Small parts example. Padded details. Adds to safety

答案 1 :(得分:1)

尝试一下

(?:\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\D?(?:\d{1,2}\D?)?\D?(?:(?:19[7-9]\d|20\d{2})|\d{2}) Slide \d+

Demo