Question

当我使用下面的python正则表达式来执行下面描述的功能时，我得到错误意外的模式结束。

正则表达式：

modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
(CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)

此正则表达式的目的：

INPUT：

CODE876
CODE223
matchjustCODE657
CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665

应匹配：

CODE876
CODE223
CODE657
CODE697

并用

替换出现次数

http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743

不匹配：

code876
testing1CODE888
testing2CODE776
example3CODE654
example2CODE098
http://replaced/CODE665

最终输出

http://productcode/CODE876
http://productcode/CODE223
matchjusthttp://productcode/CODE657
http://productcode/CODE69743
code876
testing1CODE888
example2CODE098
http://replaced/CODE665

编辑和更新1

modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)

错误不再发生。但这与所需的任何模式都不匹配。匹配组或匹配本身是否存在问题。因为当我编译这个正则表达式时，我得不到我的输入。

编辑和更新2

f=open("/Users/mymac/Desktop/regex.txt")
s=f.read()

s1 = re.sub(r'((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)', 
            r'\g<1><a href="http://productcode/\g<2>">\g<2></a>', s)
print s1

INPUT

CODE123 CODE765 testing1CODE123 example1CODE345 http://www.coding.com/CODE333 CODE345

CODE234

CODE333

输出

<a href="http://productcode/CODE123">CODE123</a> <a href="http://productcode/CODE765">CODE765</a> testing1<a href="http://productcode/CODE123">CODE123</a> example1<a href="http://productcode/CODE345">CODE345</a> http://www.coding.com/<a href="http://productcode/CODE333">CODE333</a> <a href="http://productcode/CODE345">CODE345</a>

<a href="http://productcode/CODE234">CODE234</a>

<a href="http://productcode/CODE333">CODE333</a>

正则表达式适用于Raw输入，但不适用于来自文本文件的字符串输入。

请参阅输入4和5以获得更多结果http://ideone.com/3w1E3

Answer 1

你的主要问题是(?-i)，就Python 2.7和3.2而言，这是一厢情愿的想法。有关详细信息，请参阅下文。

import re
# modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)
# (CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)
# observation 1: as presented, pattern has a line break in the middle, just after (?-i)
# ob 2: rather hard to read, should use re.VERBOSE
# ob 3: not obvious whether it's a complile-time or run-time problem
# ob 4: (?i) should be at the very start of the pattern (see docs)
# ob 5: what on earth is (?-i) ... not in 2.7 docs, not in 3.2 docs
pattern = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)'
#### rx = re.compile(pattern)
# above line failed with "sre_constants.error: unexpected end of pattern"
# try without the (?-i)
pattern2 = r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(CODE[0-9]{3})(?!</a>)'
rx = re.compile(pattern2)
# This works, now you need to work on observations 1 to 4,
# and rethink your CODE/code strategy

看起来建议充耳不闻......以下是re.VERBOSE格式的模式：

pattern4 = r'''
    ^
    (?i)
    (
        (?:
            (?!http://)
            (?!testing[0-9])
            (?!example[0-9])
            . #### what is this for?
        )*?
    ) ##### end of capturing group 1
    (CODE[0-9]{3}) #### not in capturing group 1
    (?!</a>)
    '''

Answer 2

好的，看起来问题是(?-i)，这是令人惊讶的。内联修饰符语法的目的是让您将修饰符应用于正则表达式的选定部分。至少，这就是它们在大多数口味中的作用。在Python中，似乎它们总是修改整个正则表达式，与外部标志（re.I，re.M等）相同。替代(?i:xyz)语法也不起作用。

在旁注中，我没有看到任何理由使用三个独立的前瞻，就像你在这里做的那样：

(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?

将他们放在一起：

(?:(?!http://|testing[0-9]|example[0-9]).)*?

编辑：我们似乎已经从正则表达式抛出异常的问题转移到为什么它不起作用的问题。我不确定我是否理解您的要求，但下面的正则表达式和替换字符串会返回您想要的结果。

s1 = re.sub(r'^((?!http://|testing[0-9]|example[0-9]).*?)(CODE[0-9]{3})(?!</a>)', 
            r'\g<1><a href="http://productcode/\g<2>">\g<2></a>', s)

<强> see it in action one ideone.com

这就是你要追求的吗？

编辑：我们现在知道替换是在更大的文本中完成的，而不是在独立的字符串上。这使得问题变得更加困难，但我们也知道完整的URL（以http://开头的URL）仅出现在已存在的锚元素中。这意味着我们可以将正则表达式分成两个备选方案：一个匹配完整的<a>...</a>元素，另一个匹配我们的目标字符串。

(?s)(?:(<a\s+[^>]*>.*?</a>)|\b((?:(?!testing[0-9]|example[0-9])\w)*?)(CODE[0-9]{3}))

诀窍是使用函数而不是静态字符串进行替换。每当正则表达式匹配一个锚元素时，该函数将在组（1）中找到它并保持不变。否则，它使用group（2）和group（3）来构建一个新的。

here's another demo （我知道这是可怕的代码，但我现在太累了，不能学习更多的pythonic方法。）

Answer 3

我看到的唯一问题是您使用错误的捕获组替换。

modified=re.sub(r'^(?i)((?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)',r'<a href="http://productcode/\g<1>">\g<1></a>',input)  
                       ^                                                        ^                                                        ^
                    first capturing group                                  second one                                         using the first group

在这里，我使第一个也成为非捕获组

^(?i)(?:(?:(?!http://)(?!testing[0-9])(?!example[0-9]).)*?)(?-i)(CODE[0-9]{3})(?!</a>)

见here on Regexr

Answer 4

对于复杂的正则表达式，使用re.X flag来记录您正在做的事情并确保括号正确匹配（即使用缩进来指示当前的嵌套级别）。

模式的意外结束：Python Regex

4 个答案: