如何在不保留捕获组的情况下使用regexp拆分字符串?

时间:2017-12-17 07:51:02

标签: python regex string python-3.x regex-group

我想在Python中使用带有反向引用 的正则​​表达式 拆分文本。

rexp = re.compile(r"([`]{1,})ABC\1")
rexp.split("blahblah``ABC``blahblah")

我得到['blahblah', '``', 'blahblah'],但期望['blahblah', 'blahblah']。 如何在不保留捕获组的情况下拆分字符串?

2 个答案:

答案 0 :(得分:2)

From the re.split() documentation:

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

Since you want to use a backreference you can't avoid the first capturing group, but you can make the rest of them non-capturing and post-process your split to get what you want, e.g.:

rexp = re.compile(r"([`]{1,})->\s*(?:\S+)\s*\|(?:.+?)<-\1")
rexp.split("blahblah``->Left|Right<-``blahblah")[0::2]  # ['blahblah', 'blahblah']

UPDATE: I just noticed that you changed your pattern in the meantime, but the principle is just the same:

rexp = re.compile(r"([`]{1,})ABC\1")  # also, if optimizing, equivalent to: (`+)ABC\1
rexp.split("blahblah``ABC``blahblah")[0::2]  # ['blahblah', 'blahblah']

答案 1 :(得分:1)

您可以先使用唯一分隔符替换拆分模式,然后拆分:

>>> s="blahblah``ABC``blahblah"
>>> delim="<-split->"
>>> re.split(delim, re.sub(r"([`]+)ABC\1", delim, s))
['blahblah', 'blahblah']

这种方法的优点是您不需要假设分割模式在字符串中的位置。

您可以使用更快的Python拆分,因为您已将正则表达式目标转换为固定字符串:

>>> re.sub(r"([`]+)ABC\1", delim, s).split(delim)
['blahblah', 'blahblah']

更新

显示此时间的速度与接受的答案一样快:

import re

def f1(s):
    rexp = re.compile(r"([`]{1,})ABC\1")
    return rexp.split(s)[0::2]

def f2(s):
    delim="<-split->"  
    rexp1=re.compile(r"([`]+)ABC\1")  
    rexp2=re.compile(delim)
    return rexp2.split(rexp1.sub(delim, s))

def f3(s):
    delim="<-split->"  
    rexp=re.compile(r"([`]+)ABC\1")  
    return rexp.sub(delim, s).split(delim) 

if __name__=='__main__':
    import timeit    
    for case, x in (('small',1000),('med',10000),('large',1000000)):  
        s="blahblah``ABC``blahblah"*x
        print("Case {}, {:,} x, All equal: {}".format(case,x,(f1(s)==f2(s)==f3(s))))
        for f in (f1,f2,f3):
            print("   {:^10s}{:.4f} secs".format(f.__name__, timeit.timeit("f(s)", setup="from __main__ import f, s", number=10)))

在我的旧版iMac上,Python 3.6打印:

Case small, 1,000 x, All equal: True
       f1    0.0049 secs
       f2    0.0048 secs
       f3    0.0045 secs
Case med, 10,000 x, All equal: True
       f1    0.0512 secs
       f2    0.0536 secs
       f3    0.0526 secs
Case large, 1,000,000 x, All equal: True
       f1    5.2092 secs
       f2    5.6808 secs
       f3    5.5388 secs

使用PyPy,按照我建议的方式进行操作会更快:

Case small, 1,000 x, All equal: True
       f1    0.0020 secs
       f2    0.0021 secs
       f3    0.0012 secs
Case med, 10,000 x, All equal: True
       f1    0.0325 secs
       f2    0.0288 secs
       f3    0.0217 secs
Case large, 1,000,000 x, All equal: True
       f1    4.4900 secs
       f2    3.0680 secs
       f3    2.1079 secs

所以不确定对于非常大的输入字符串是什么意思,这是一个可怕的成本... - 时间显示即使使用巨大的输入字符串它也是相同或更快