Question

我刚刚在regex101编辑器中构建了这个表达式，但是不小心忘了将它转换为Python flavor语法。我不熟悉差异，但认为它们相当小。他们不是。

Perl/pcre的步数比Python少了99.89％（6,377,715对6,565步）

正则表达式：

^(\d{1,3}) +((?:[a-zA-Z0-9\(\)\-≠,]+ )+) +£ *((?:[\d]  {1,4}|\d)+)∑([ \d]+)?

任何帮助将不胜感激！感谢。

修改

数据源是从PDF中提取的多行文本，导致输出不完美（您可以看到base source PDF here）

我试图提取特定行的箱号，标题和任何存在（填写）的号码。如果您查看上面的链接，您可以看到完整的示例。 例如：

以下是Regex101的屏幕截图，显示正面匹配。最顶部的行匹配显示框号（155），标题（交易利润）和数字（5561）。

限制：

理想情况下，在PCRE compiler中看到它们时提取的值 - 在匹配前后很少或没有额外的空格 - 只有框号，标题和值。
只有在填写了数字/值的情况下才匹配（例如上例中的5561，因此不匹配紧随其后的行 - 框160，但匹配框165）。
格式在表单下方更改，我有一个单独的正则表达式，所以请忽略它。

Answer 1

提案：使用支持原子组和占有量词的较新regex module。与初始 PCRE表达式相比，这可以减少约50％的步骤（请参阅a demo on regex101.com）：

^
(\d{1,3})\s++
((?>[^£\n]+))£\s++
([ \d]+)(?>[^∑\n]+)∑\s++
([ \d]+)

<小时/> 为了实现这一目标，您可以：

import regex as re
rx = re.compile(r'''
    ^
    (\d{1,3})\s++
    ((?>[^£\n]+))£\s++
    ([ \d]+)(?>[^∑\n]+)∑\s++
    ([ \d]+)''', re.M | re.X)

matches = [[group.strip() for group in m.groups()] for m in rx.finditer(data)]
print(matches)

除了给定的以外哪个收益率：

[['145', 'Total turnover from trade', '5    2    0  0  0', '0  0'], ['155', 'Trading profits', '5  5  6  1', '0  0'], ['165', 'Net trading profits ≠ box 155 minus box 160', '5    5  6  1', '0  0'], ['235', 'P rofits before other deductions and reliefs ≠ net sum of', '5  5  6  1', '0  0'], ['300', 'Profits before qualifying donations and group relief ≠', '5  5    6  1', '0     0'], ['315', 'Profits chargeable to Corporation Tax ≠', '5  5    6  1', '0     0'], ['475', 'Net Corporation Tax liability ≠ box 440 minus box 470', '1  0  5  6', '5  9'], ['510', 'Tax chargeable ≠ total of boxes 475, 480, 500 and 505', '1  0  5  6', '5  9'], ['525', 'Self-assessment of tax payable ≠ box 510 minus box 515', '1  0  5  6', '5  9'], ['600', 'Tax outstanding ≠', '1  0  5  6', '5  9']]

为什么这个Regexp使用pcre而不是Python的步数减少了99.89％？

修改

1 个答案: