如何使用正则表达式捕获重复字符集?

时间:2017-10-01 19:33:15

标签: python regex

import re
line = "..12345678910111213141516171820212223"
regex = re.compile(r'((?:[a-zA-Z0-9])\1+)')
print ("not coming here")
matches = re.findall(regex,line)
print (matches)

在上面的代码中,我试图捕获重复字符组。

所以例如我需要像这样的答案: 111 222 等

但是当我运行上面的代码时,我得到了这个错误:

Traceback (most recent call last):
  File "First.py", line 3, in <module>
    regex = re.compile(r'((?:[a-zA-Z0-9])\1+)')
  File "C:\Users\bhatsubh\AppData\Local\Programs\Python\Python35\lib\re.py", lin
e 224, in compile
    return _compile(pattern, flags)
  File "C:\Users\bhatsubh\AppData\Local\Programs\Python\Python35\lib\re.py", lin
e 293, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Users\bhatsubh\AppData\Local\Programs\Python\Python35\lib\sre_compile
.py", line 536, in compile
    p = sre_parse.parse(p, flags)
  File "C:\Users\bhatsubh\AppData\Local\Programs\Python\Python35\lib\sre_parse.p
y", line 829, in parse
    p = _parse_sub(source, pattern, 0)
  File "C:\Users\bhatsubh\AppData\Local\Programs\Python\Python35\lib\sre_parse.p
y", line 437, in _parse_sub
    itemsappend(_parse(source, state))
  File "C:\Users\bhatsubh\AppData\Local\Programs\Python\Python35\lib\sre_parse.p
y", line 778, in _parse
    p = _parse_sub(source, state)
  File "C:\Users\bhatsubh\AppData\Local\Programs\Python\Python35\lib\sre_parse.p
y", line 437, in _parse_sub
    itemsappend(_parse(source, state))
  File "C:\Users\bhatsubh\AppData\Local\Programs\Python\Python35\lib\sre_parse.p
y", line 524, in _parse
    code = _escape(source, this, state)
  File "C:\Users\bhatsubh\AppData\Local\Programs\Python\Python35\lib\sre_parse.p
y", line 415, in _escape
    len(escape))
sre_constants.error: cannot refer to an open group at position 16

有人请指导我哪里出错。

3 个答案:

答案 0 :(得分:2)

你(可能)想要

([a-zA-Z0-9])\1+

a demo on regex101.com

<小时/> 在Python

import re
line = "..12345678910111213141516171820212223"
regex = re.compile(r'([a-zA-Z0-9])\1+')

matches = [match.group(0) for match in regex.finditer(line)]
print (matches)
# ['111', '222']

答案 1 :(得分:2)

在另一个组中找不到组引用。如果您只想打印出那些重复的字符,那么您可以使用re.sub进行小型黑客攻击:

def foo(m):
     print(m.group(0))
     return ''

_ = re.sub(r'(\w)\1+', foo, line) # use [a-zA-Z0-9] if you don't want to match underscores
111
222

答案 2 :(得分:1)

可能使用.findall执行此操作,但使用.finditer执行此操作更为简单,如Jan&#39}所示。答案。

import re

line = "..12345678910111213141516171820212223"
regex = re.compile(r'(([a-zA-Z0-9])\2+)')

matches = [t[0] for t in regex.findall(line)]
print(matches)

<强>输出

['111', '222']

我们使用\2,因为\1引用外括号中的模式,而\2引用内括号中的模式。