替换与正则表达式匹配的相邻相同令牌

时间:2014-11-23 21:56:01

标签: python regex

在python应用程序中,我需要替换与正则表达式匹配的空白分隔的标记的相邻相同出现位置,例如对于诸如" a \ w \ w"

之类的模式
"xyz abc abc zzq ak9 ak9 ak9 foo abc" --> "xyz abc*2 zzq ak9*3 foo bar abc" 

修改

我上面的示例并没有明确表示不应该聚合与正则表达式不匹配的令牌。一个更好的例子是

"xyz xyz abc abc zzq ak9 ak9 ak9 foo foo abc" 
--> "xyz xyz abc*2 zzq ak9*3 foo foo bar abc"

结束编辑

我在下面发布了工作代码,但它似乎比它应该更复杂。

我不是在寻找一轮代码高尔夫,但我会对使用具有类似性能的标准Python库更具可读性的解决方案感兴趣。

在我的应用程序中,可以安全地假设输入字符串的长度小于10000个字符,并且任何给定的字符串将只包含少数字符,比如说< 10,与模式匹配的可能字符串。

import re

def fm_pattern_factory(ptnstring):
    """
    Return a regex that matches two or more occurrences 
    of ptnstring separated by whitespace.
    >>> fm_pattern_factory('abc').match(' abc abc ') is None
    False
    >>> fm_pattern_factory('abc').match('abc') is None
    True
    """
    ptn = r"\s*({}(?:\s+{})+)\s*".format(ptnstring, ptnstring)
    return re.compile(ptn)

def fm_gather(target, ptnstring):
    """
    Replace adjacent occurences of ptnstring in target with
    ptnstring*N where n is the number occurrences.
    >>> fm_gather('xyz abc abc def abc', 'abc')
    'xyz abc*2 def abc'
    >>> fm_gather('xyz abc abc def abc abc abc qrs', 'abc')
    'xyz abc*2 def abc*3 qrs'
    """
    ptn = fm_pattern_factory(ptnstring)
    result = []
    index = 0
    for match in ptn.finditer(target):
        result.append(target[index:match.start()+1])
        repl = "{}*{}".format(ptnstring, match.group(1).count(ptnstring))
        result.append(repl)
        index = match.end() - 1

    result.append(target[index:])
    return "".join(result)

def fm_gather_all(target, ptn):
    """ 
    Apply fm_gather() to all distinct matches for ptn.
    >>> s = "x abc abc y abx abx z acq"
    >>> ptn = re.compile(r"a..")
    >>> fm_gather_all(s, ptn)
    'x abc*2 y abx*2 z acq'
    """
    ptns = set(ptn.findall(target))
    for p in ptns:
        target = fm_gather(target, p)
    return "".join(target)

2 个答案:

答案 0 :(得分:1)

很抱歉,在看到您的第一条评论之前,我正在研究答案。如果这不能解答您的问题,请告诉我,我会将其删除或尝试相应修改。

对于问题中提供的简单输入(以下代码中的内容存储在my_string变量中),您可以尝试不同的方法:遍历输入列表并保留{{“桶”{{ 1}}:

<matching_word, num_of_occurrences>

输出:

my_string="xyz abc abc zzq ak9 ak9 ak9 foo abc"
my_splitted_string=my_string.split(' ')
occurrences = []
print ("my_splitted_string is a %s now containing: %s"
       % (type(my_splitted_string), my_splitted_string))

current_bucket = [my_splitted_string[0], 1]
occurrences.append(current_bucket)
for i in range(1, len(my_splitted_string)):
    current_word = my_splitted_string[i]
    print "Does %s match %s?" % (current_word, current_bucket[0])
    if current_word == current_bucket[0]:
        current_bucket[1] += 1
        print "It does. Aggregating"
    else:
        current_bucket = [current_word, 1]
        occurrences.append(current_bucket)
        print "It doesn't. Creating a new 'bucket'"

print "Collected occurrences: %s" % occurrences
# Now re-collect:
re_collected_str=""
for occurrence in occurrences:
    if occurrence[1] > 1:
        re_collected_str += "%s*%d " % (occurrence[0], occurrence[1])
    else:
        re_collected_str += "%s " % (occurrence[0])
print "Compressed string: '%s'"

(注意最后的空白区域)

答案 1 :(得分:0)

以下似乎很稳定,并且在我的应用程序中具有良好的性能。感谢BorrajaX的回答,指出不是绝对必要的扫描输入字符串的好处。

下面的函数还会在输出中保留换行符和空格。我忘记在我的问题中说明,但在我的应用程序中需要产生一些人类可读的中间输出。

def gather_token_sequences(masterptn, target):
    """
    Find all sequences in 'target' of two or more identical adjacent tokens
    that match 'masterptn'.  Count the number of tokens in each sequence.
    Return a new version of 'target' with each sequence replaced by one token
    suffixed with '*N' where N is the count of tokens in the sequence.
    Whitespace in the input is preserved (except where consumed within replaced
    sequences).

    >>> mptn = r'ab\w'
    >>> tgt = 'foo abc abc'
    >>> gather_token_sequences(mptn, tgt)
    'foo abc*2'

    >>> tgt = 'abc abc '
    >>> gather_token_sequences(mptn, tgt)
    'abc*2 '

    >>> tgt = '\\nabc\\nabc abc\\ndef\\nxyz abx\\nabx\\nxxx abc'
    >>> gather_token_sequences(mptn, tgt)
    '\\nabc*3\\ndef\\nxyz abx*2\\nxxx abc'
    """

    # Emulate python's strip() function except that the leading and trailing
    # whitespace are captured for final output. This guarantees that the
    # body of the remaining string will start and end with a token, which
    # slightly simplifies the subsequent matching loops.
    stripped = re.match(r'^(\s*)(\S.*\S)(\s*)$', target, flags=re.DOTALL)
    head, body, tail = stripped.groups()

    # Init the result list and loop variables.
    result = [head]
    i = 0
    token = None
    while i < len(body):
        ## try to match master pattern
        match = re.match(masterptn, body[i:])
        if match is None:
            ## Append char and advance.
            result += body[i]
            i += 1

        else:
            ## Start new token sequence
            token = match.group(0)
            esc = re.escape(token) # might have special chars in token
            ptn = r"((?:{}\s+)+{})".format(esc, esc)
            seq = re.match(ptn, body[i:])
            if seq is None: # token is not repeated.
                result.append(token)
                i += len(token)
            else:
                seqstring = seq.group(0)
                replacement = "{}*{}".format(token, seqstring.count(token))
                result.append(replacement)
                i += len(seq.group(0))

    result.append(tail)
    return ''.join(result)