我想在Python
中使用带有反向引用 的正则表达式 拆分文本。
rexp = re.compile(r"([`]{1,})ABC\1")
rexp.split("blahblah``ABC``blahblah")
我得到['blahblah', '``', 'blahblah']
,但期望['blahblah', 'blahblah']
。
如何在不保留捕获组的情况下拆分字符串?
答案 0 :(得分:2)
From the re.split()
documentation:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
Since you want to use a backreference you can't avoid the first capturing group, but you can make the rest of them non-capturing and post-process your split to get what you want, e.g.:
rexp = re.compile(r"([`]{1,})->\s*(?:\S+)\s*\|(?:.+?)<-\1")
rexp.split("blahblah``->Left|Right<-``blahblah")[0::2] # ['blahblah', 'blahblah']
UPDATE: I just noticed that you changed your pattern in the meantime, but the principle is just the same:
rexp = re.compile(r"([`]{1,})ABC\1") # also, if optimizing, equivalent to: (`+)ABC\1
rexp.split("blahblah``ABC``blahblah")[0::2] # ['blahblah', 'blahblah']
答案 1 :(得分:1)
您可以先使用唯一分隔符替换拆分模式,然后拆分:
>>> s="blahblah``ABC``blahblah"
>>> delim="<-split->"
>>> re.split(delim, re.sub(r"([`]+)ABC\1", delim, s))
['blahblah', 'blahblah']
这种方法的优点是您不需要假设分割模式在字符串中的位置。
您可以使用更快的Python拆分,因为您已将正则表达式目标转换为固定字符串:
>>> re.sub(r"([`]+)ABC\1", delim, s).split(delim)
['blahblah', 'blahblah']
显示此时间的速度与接受的答案一样快:
import re
def f1(s):
rexp = re.compile(r"([`]{1,})ABC\1")
return rexp.split(s)[0::2]
def f2(s):
delim="<-split->"
rexp1=re.compile(r"([`]+)ABC\1")
rexp2=re.compile(delim)
return rexp2.split(rexp1.sub(delim, s))
def f3(s):
delim="<-split->"
rexp=re.compile(r"([`]+)ABC\1")
return rexp.sub(delim, s).split(delim)
if __name__=='__main__':
import timeit
for case, x in (('small',1000),('med',10000),('large',1000000)):
s="blahblah``ABC``blahblah"*x
print("Case {}, {:,} x, All equal: {}".format(case,x,(f1(s)==f2(s)==f3(s))))
for f in (f1,f2,f3):
print(" {:^10s}{:.4f} secs".format(f.__name__, timeit.timeit("f(s)", setup="from __main__ import f, s", number=10)))
在我的旧版iMac上,Python 3.6打印:
Case small, 1,000 x, All equal: True
f1 0.0049 secs
f2 0.0048 secs
f3 0.0045 secs
Case med, 10,000 x, All equal: True
f1 0.0512 secs
f2 0.0536 secs
f3 0.0526 secs
Case large, 1,000,000 x, All equal: True
f1 5.2092 secs
f2 5.6808 secs
f3 5.5388 secs
使用PyPy,按照我建议的方式进行操作会更快:
Case small, 1,000 x, All equal: True
f1 0.0020 secs
f2 0.0021 secs
f3 0.0012 secs
Case med, 10,000 x, All equal: True
f1 0.0325 secs
f2 0.0288 secs
f3 0.0217 secs
Case large, 1,000,000 x, All equal: True
f1 4.4900 secs
f2 3.0680 secs
f3 2.1079 secs
所以不确定对于非常大的输入字符串是什么意思,这是一个可怕的成本... - 时间显示即使使用巨大的输入字符串它也是相同或更快