正则表达式,用于捕获具有特定模式的日期

时间:2019-05-23 21:29:45

标签: python regex regex-lookarounds regex-group regex-greedy

我正在尝试从多个pdf中提取数据。有一个与日期相关的数据点,其中日期之前的字符串在某些pdf中有所不同。我检查了各个正则表达式语句是否正常工作,但是,当我尝试将正则表达式语句合并到for循环中的一个语句中时,我没有提取日期。这是我要与它们的代码匹配的字符串,这些代码与它们各自的正则表达式语句的代码匹配,这些语句在“ DATE OF BIRTHDAY”之后提取日期信息:

DATE OF BIRTHDAY\n01/11/2011
date_of_birthday1 = re.search('(?<=DATE OF BIRTHDAY \\n)(.*)', img).groups()

DATE OF BIRTHDAY\n\n02/14/2015
date_of_birthday2 = re.search('(?<=DATE OF BIRTHDAY \\n\\n)(.*)', img).groups()

DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll 05/07/2018
date_of_birthday3 = re.search('(?<=DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll)(.*)', img).groups()

我正在尝试将这些正则表达式语句组合为or语句,以便可以在for循环中使用它们,如下所示:

date_of_birthdays = re.search('(?<=DATE OF BIRTHDAY\\n\\n)(.*)|(?<=DATE OF BIRTHDAY\\n)(.*)|(?<=DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll)(.*)', img).groups

我的预期输出是

df['Birthdays'] = date_of_birthdays

如下所示:

df = pd.DataFrame({"Birthdays": ['01/11/2011', '02/14/2015', '05/07/2018']})
df

但是,我无法提取任何日期信息。关于我在这里做错了什么的想法?

1 个答案:

答案 0 :(得分:1)

这有效

>>> import re
>>> re.findall(
...  r"(?:DATE[ ]OF[ ]BIRTHDAY)(?:\\n(?:\\n)?|[ ]GIRL[ ]\\n\\ni[ ]:[ ]Pll[ ]i[ ]ii\\ni[ ]\\n\\nPll[ ])?(.*)",
...  (
...  r'DATE OF BIRTHDAY\n01/11/2011' + "\n"
...  r'DATE OF BIRTHDAY\n\n02/14/2015' + "\n"
...  r'DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll 05/07/2018' + "\n"
...  ))
['01/11/2011', '02/14/2015', '05/07/2018']
>>>

正则表达式扩展

 (?: DATE [ ] OF [ ] BIRTHDAY )

 (?:
      \\ n 
      (?: \\ n )?
   |  [ ] GIRL [ ] \\ n \\ ni [ ] : [ ] Pll [ ] i [ ] ii \\ n i [ ] \\ n \\ n Pll [ ] 
 )?
 ( .* )                        # (1)

只是合理警告,该表达式带有后置断言
在这两个交替中都存在问题:

   (?<= DATE [ ] OF [ ] BIRTHDAY \\ n \\ n )
   ( .* )                        # (1)
|  (?<= DATE [ ] OF [ ] BIRTHDAY \\ n )
   ( .* )                        # (2)  

很难想象,所以我要说出来,
捕获组1(第一个变更)将永远不匹配!

原因是总是先检查向后较短的距离。
由于.*提供了一种匹配方式,因此较短的匹配项只有一个\n
文字将始终优先匹配。

您可以通过添加这样的(?!\\n)

来迫使它 not 不匹配来解决此问题
   (?<= DATE [ ] OF [ ] BIRTHDAY \\ n \\ n )
   ( .* )                        # (1)
|  (?<= DATE [ ] OF [ ] BIRTHDAY \\ n )
   (?! \\ n )
   ( .* )                        # (2)  

好吧,这是没有道理的,所以这是
的一些基准 正在考虑的方法(这实际上并不是理想的方法)

Regex1:   (?:DATE[ ]OF[ ]BIRTHDAY)(?:\\n(?:\\n)?|[ ]GIRL[ ]\\n\\ni[ ]:[ ]Pll[ ]i[ ]ii\\ni[ ]\\n\\nPll[ ])?(.*)
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   3
Elapsed Time:    0.29 s,   294.80 ms,   294801 µs
Matches per sec:   508,817


Regex2:   (?:(?<=DATE[ ]OF[ ]BIRTHDAY\\n\\n)|(?<=DATE[ ]OF[ ]BIRTHDAY\\n)(?!\\n)|(?<=DATE[ ]OF[ ]BIRTHDAY[ ]GIRL[ ]\\n\\ni[ ]:[ ]Pll[ ]i[ ]ii\\ni[ ]\\n\\nPll[ ]))(.*)
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   3
Elapsed Time:    2.27 s,   2268.42 ms,   2268417 µs
Matches per sec:   66,125


Regex3:   (?<=DATE[ ]OF[ ]BIRTHDAY\\n\\n)(.*)|(?<=DATE[ ]OF[ ]BIRTHDAY\\n)(?!\\n)(.*)|(?<=DATE[ ]OF[ ]BIRTHDAY[ ]GIRL[ ]\\n\\ni[ ]:[ ]Pll[ ]i[ ]ii\\ni[ ]\\n\\nPll[ ])(.*)
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   3
Elapsed Time:    2.76 s,   2760.81 ms,   2760809 µs
Matches per sec:   54,331