Question

例如我有字符串：

 aacbbbqq

结果我希望得到以下匹配：

 (aa, c, bbb, qq)

我知道我可以这样写：

 ([a]+)|([b]+)|([c]+)|...

但我觉得我很难看并且寻找更好的解决方案。我正在寻找正则表达式解决方案，而不是自编的有限状态机。

Answer 1

您可以将其与：(\w)\1*

相匹配

Answer 2

itertools.groupby不是RexExp，但它也不是自编的。 :-)来自python docs的引用：

# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D

Answer 3

一般

诀窍是匹配你想要的范围内的一个字符，然后确保你匹配相同字符的所有重复：

>>> matcher= re.compile(r'(.)\1*')

这匹配任何单个字符（.），然后匹配它（\1*）（如果有的话）。

对于输入字符串，您可以获得所需的输出：

>>> [match.group() for match in matcher.finditer('aacbbbqq')]
['aa', 'c', 'bbb', 'qq']

注意：由于匹配组，re.findall将无法正常运行。

其他范围

如果您不想匹配任何字符，请相应更改正则表达式中的.：

>>> matcher= re.compile(r'([a-z])\1*') # only lower case ASCII letters
>>> matcher= re.compile(r'(?i)([a-z])\1*') # only ASCII letters
>>> matcher= re.compile(r'(\w)\1*') # ASCII letters or digits or underscores
>>> matcher= re.compile(r'(?u)(\w)\1*') # against unicode values, any letter or digit known to Unicode, or underscore

针对u'hello²²'（Python 2.x）或'hello²²'（Python 3.x）检查后者：

>>> text= u'hello=\xb2\xb2'
>>> print('\n'.join(match.group() for match in matcher.finditer(text)))
h
e
ll
o
²²

如果您第一次发出locale.setlocale电话，则可能会修改

\w非Unicode字符串/字节数组。

Answer 4

这将有效，请参阅此处的工作示例：http://www.rubular.com/r/ptdPuz0qDV

(\w)\1*

Answer 5

如果你像这样捕获反向引用，findall方法将起作用：

result = [match[1] + match[0] for match in re.findall(r"(.)(\1*)", string)]

Answer 6

您可以使用：

re.sub(r"(\w)\1*", r'\1', 'tessst')

输出结果为：

'test'

RegEx匹配重复的字符

6 个答案:

一般

其他范围