正则表达式无法排除带有换行符的匹配项

时间:2020-10-04 19:47:10

标签: python regex

我运行以下正则表达式。我使用变量将其分解以更加清楚:

var image

基本上,我正在尝试标识一个18位数字的ID。 我不想匹配包含任何字母,换行符或正斜杠的18位数字。如果我将18位ID与其他随机符号匹配,就可以了。我也不想与任何数字开头的ID相匹配。我也想在主要小组之前再加上一条额外的线,以使我对比赛有更好的了解,但是我确实在id_re比赛之后(这就是为什么我在all_nno_numb_new_line旁边加一个问号“?”)。

然后我使用以下代码运行以下代码:

all_no_numb_newline = r'(?:[^\n\d]*\n)' ## I include an extra line just to get more context ##
all_no_numb = r'(?:[^\n\d]*)' ## I do not want there to be any numbers on the same line except the ID ##
x1 = r'(?!(1-888-555|\(888\)))' ## I am excluding a specific common phone number ##
x2 = r'(?![\n\/])\W{0,2}' ## I am excluding line breaks and date formats ##
id_re = f'({x1}\d(?:{x2}\d){{16}}\d)' ## This is an ID number 18 digits long with some symbols in between ##

但是,它仍然返回以下匹配项:

re.findall(
    "("+
    all_no_numb_newline+"?"+
    all_no_numb+
    id_re+")"
    , text)[0]

我希望没有换行符,而且希望有两组(我的常规比赛和我的ID组)。为什么有3组而不是2组?为什么在比赛中出现“ \ n”,即换行符?

编辑:比赛示例

('L1 (061510)\n1009671-1000', '1 (061510)\n1009671-1000', '')

编辑:不应匹配的示例

'Mortgage\nID 756953480812037780'
')\n*DT756953480812037780'
'\nq75695348081 0233 240'
')\n*DT756953480812037780'
'\nq03313375233 0233 329'
'ID 676170114397739293'
'ID NUMBER 676170114397739293'
'ID\n676170114397739293'
'ID676170114397739293'

OUTPUT:

'756953480812037780'
'756953480812037780'
'75695348081 0233 240'
'756953480812037780'
'03313375233 0233 329'
'676170114397739293'
'676170114397739293'
'676170114397739293'
'676170114397739293'

2 个答案:

答案 0 :(得分:1)

使用

(?<!\d)\d(?:\s*\d){16}\d(?!\d)

请参见proof

说明

--------------------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
    \d                       digits (0-9)
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  \d                       digits (0-9)
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (16 times):
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    \d                       digits (0-9)
--------------------------------------------------------------------------------
  ){16}                    end of grouping
--------------------------------------------------------------------------------
  \d                       digits (0-9)
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    \d                       digits (0-9)
--------------------------------------------------------------------------------
  )                        end of look-ahead

答案 1 :(得分:0)

我认为Ryszard的答案不会消除\ n。我采用了一种更加hackey的方式:

                YY = r'(?!-888-)'
                XX = r'[^A-Za-z\d\n\\\/\)\(]{0,2}'
                id_re= f'({YY}\d(?:{XX}\d|\d{XX}){{16}}\d)'

YY消除了显示的常用电话号码 XX保留除\ n以外的所有非字母数字字符。无论我实现了多少前瞻性操作,其他过程始终以\ n出现。因此,我决定通过手工消除所有字母数字和\ n(以及会导致日期或电话号码带有斜线和括号引起混淆的其他符号)来使用一种更简单但更灵活的方法。

此正则表达式非常成功,我几乎赢得了99%的比赛!