确定字符串在Python中是否包含3个或更多重复的连续字符

时间:2015-01-18 04:19:51

标签: python regex string

我将经历近120亿个字符串组合。我试图找到最快速优化的方法来确定相关字符串是否有3个(或更多)连续的重复字符。

例如:

string = "blah"

测试应该返回false。

string = "blaaah"

这将返回true。

我成功实现了一个基本的for循环,循环遍历每个字符串的字符,并比较匹配的下一个字符。这很有用,但是对于我过滤的字符串数量,我真的很想优化它。

有什么建议吗?谢谢!

3 个答案:

答案 0 :(得分:5)

通过re模块。

>>> def consecutive(string):
        if re.search(r'(.)\1\1', string):
            print('True')
        else:
            print('False')


>>> consecutive('blah')
False
>>> consecutive('blaah')
False
>>> consecutive('blaaah')
True
>>> consecutive('blaaaah')
True

()调用捕获组,用于捕获与该组内部存在的模式匹配的字符。 \1反向引用捕获组中存在的字符。在字符串blaaah中,(.)捕获第一个a并检查a的两次出现}。所以aaa得到了匹配。

答案 1 :(得分:3)

您可以在此处使用itertools.groupby()。你仍然需要扫描字符串,但正则表达式也是如此:

from itertools import groupby

three_or_more = (char for char, group in groupby(input_string)
                 if sum(1 for _ in group) >= 3)

这产生一个发电机;迭代它以列出3次或更多次找到的所有字符,或使用any()查看是否至少有一个这样的组:

if any(three_or_more):
    # found at least one group of consecutive characters that
    # consists of 3 or more.

不幸的是,re解决方案在这里效率更高:

>>> from timeit import timeit
>>> import random
>>> from itertools import groupby
>>> import re
>>> import string
>>> def consecutive_groupby(string):
...     three_or_more = (char for char, group in groupby(string)
...                      if sum(1 for _ in group) >= 3)
...     return any(three_or_more)
... 
>>> def consecutive_re(string):
...     return re.search(r'(.)\1\1', string) is not None
... 
>>> # worst-case: random data with no consecutive strings
...
>>> test_string = ''.join([random.choice(string.ascii_letters) for _ in range(1000)])
>>> consecutive_re(test_string), consecutive_groupby(test_string)
(False, False)
>>> timeit('consecutive(s)', 'from __main__ import test_string as s, consecutive_re as consecutive', number=10000)
0.19730806350708008
>>> timeit('consecutive(s)', 'from __main__ import test_string as s, consecutive_groupby as consecutive', number=10000)
4.633949041366577
>>> # insert repeated characters
...
>>> test_string_with_repeat = test_string[:100] + 'aaa' + test_string[100:]
>>> consecutive_re(test_string_with_repeat), consecutive_groupby(test_string_with_repeat)
(True, True)
>>> timeit('consecutive(s)', 'from __main__ import test_string_with_repeat as s, consecutive_re as consecutive', number=10000)
0.03344106674194336
>>> timeit('consecutive(s)', 'from __main__ import test_string_with_repeat as s, consecutive_groupby as consecutive', number=10000)
0.4827418327331543

Avinash给出的正则表达方法在这里是明显的赢家,这表明你应该总是测量替代方案。

答案 2 :(得分:1)

您可以定义捕获组模式,然后重复搜索:

import re

s = 'blaaah'
p = '(?P<g>.)(?P=g){2}'

m = re.search(p, s, re.M)
print(m).group(0)

<强>结果:

aaa