用' X'

时间:2017-04-28 06:19:50

标签: python string replace substring counter

给出一个字符串:

s = 'cdababef'

我们计算之前的角色和之后的角色:

def per_window(sequence, n=1):
    """
    From http://stackoverflow.com/q/42220614/610569
        >>> list(per_window([1,2,3,4], n=2))
        [(1, 2), (2, 3), (3, 4)]
        >>> list(per_window([1,2,3,4], n=3))
        [(1, 2, 3), (2, 3, 4)]
    """
    start, stop = 0, n
    seq = list(sequence)
    while stop <= len(seq):
        yield tuple(seq[start:stop])
        start += 1
        stop += 1

char_before= defaultdict(Counter)
char_after = defaultdict(Counter) 
for window in per_window(s, 3):
    char_after[window[:2]][window[2]] += 1
    char_before[window[1:]][window[0]] += 1

[OUT]:

>>> char_after
defaultdict(collections.Counter,
            {('a', 'b'): Counter({'a': 1, 'e': 1}),
             ('b', 'a'): Counter({'b': 1}),
             ('b', 'e'): Counter({'f': 1}),
             ('c', 'd'): Counter({'a': 1}),
             ('d', 'a'): Counter({'b': 1})})

>>> char_before
defaultdict(collections.Counter,
            {('a', 'b'): Counter({'b': 1, 'd': 1}),
             ('b', 'a'): Counter({'a': 1}),
             ('b', 'e'): Counter({'a': 1}),
             ('d', 'a'): Counter({'c': 1}),
             ('e', 'f'): Counter({'b': 1})})

如果我将ab的所有实例替换为x,我们需要更新char_afterchar_before计数,目标是实现而不重新计算s = 'cdxxef' 的所有子字符串,如:

s = 'cdxxef'
char_before2 = defaultdict(Counter)
char_after2 = defaultdict(Counter) 
for window in per_window(s, 3):
    char_after2[window[:2]][window[2]] += 1
    char_before2[window[1:]][window[0]] += 1

[期望的输出]:

>>> char_before2
defaultdict(collections.Counter,
            {('d', 'x'): Counter({'c': 1}),
             ('e', 'f'): Counter({'x': 1}),
             ('x', 'e'): Counter({'x': 1}),
             ('x', 'x'): Counter({'d': 1})})

>>> char_after2
defaultdict(collections.Counter,
            {('c', 'd'): Counter({'x': 1}),
             ('d', 'x'): Counter({'x': 1}),
             ('x', 'e'): Counter({'f': 1}),
             ('x', 'x'): Counter({'e': 1})})

如何在不重新计算所有子字符串但仅重新计算受替换影响的子字符串的情况下完成子字符串的更新?

我试过了:

s = 'cdababef'

char_before= defaultdict(Counter)
char_after = defaultdict(Counter) 
for window in per_window(s, 3):
    char_after[window[:2]][window[2]] += 1
    char_before[window[1:]][window[0]] += 1

source, target = ('a', 'b'), 'x'
for ch in char_before[source]:
    count_before = char_before[source][ch]
    char_before[target][ch] += count_before
    char_before[source][ch] = 0

    count_after = char_after[source][ch]
    char_after[target][ch] += count_after
    char_before[source][ch] = 0

但输出不是char_before2char_after2所需的输出:

>>> char_before
defaultdict(collections.Counter,
            {'x': Counter({'b': 1, 'd': 1}),
             ('b', 'a'): Counter({'a': 1}),
             ('d', 'a'): Counter({'c': 1}),
             ('b', 'e'): Counter({'a': 1}),
             ('a', 'b'): Counter({'b': 0, 'd': 0}),
             ('e', 'f'): Counter({'b': 1})})

>>> char_after
defaultdict(collections.Counter,
            {'x': Counter({'b': 0, 'd': 0}),
             ('b', 'a'): Counter({'b': 1}),
             ('d', 'a'): Counter({'b': 1}),
             ('b', 'e'): Counter({'f': 1}),
             ('a', 'b'): Counter({'a': 1, 'e': 1}),
             ('c', 'd'): Counter({'a': 1})})

3 个答案:

答案 0 :(得分:5)

以下是一种通过三个步骤解决此问题的方法:

  1. 识别受替换影响的子串
  2. 从这些子字符串中的char_beforechar_after词典中删除计数
  3. 执行替换并将新计数添加到仅受影响的子字符串的char_beforechar_after字典
  4. 首先让我们先定义一些变量并运行初始代码。

    source, target = ('a', 'b'), 'x'
    n = 3
    
    char_before= defaultdict(Counter)
    char_after = defaultdict(Counter) 
    for window in per_window(s, n):
        char_after[window[:2]][window[2]] += 1
        char_before[window[1:]][window[0]] += 1
    

    现在我们找到要替换的子串的跨度(开始和结束索引)(注意我们实际上还没有进行任何替换)

    import re 
    
    spans = [m.span() for m in re.finditer(''.join(source), s)]
    

    但是我们知道落入其中一个跨度的窗口的前后计数并不是唯一会受到替换影响的窗口。直接位于其中一个跨度之前或之后的任何窗口也将受到影响。例如,在s = 'cdababef'中,如果我们将'ab'替换为'x',则初始'cd'将需要更新char_after计数,即使'cd'没有{ {1}}本身已被替换。

    为了解决这个问题,我们定义了一个名为merge_spans的函数,它不仅可以合并相邻的跨度((2,4)(4,6)变为(2,6)),还可以合并extra内的跨距。 1}}彼此的空格(其中extra是由关键字参数定义的整数)。这背后的直觉是,这将返回一个跨度列表,其中跨度对应于更换影响计数之前/之后的所有子串。

    def merge_spans(spans, extra = 0):
        extra = max(0,extra)
        merged = spans[:]
        if len(merged) == 1:
            return [(max(merged[0][0]-extra, 0), merged[0][-1]+extra)]
        for i in range(1, len(merged)):
            span = merged[i]
            prev = merged[i-1]
            if prev[-1]+extra >= span[0]-extra:
                merged[i] = (max(0,prev[0]-extra), span[-1]+extra)
                merged[i-1] = ()
            elif i == len(merged)-1:
                merged[i] = (max(0,span[0]-extra), span[-1]+extra)
                merged[i-1] = (max(0,prev[0]-extra), prev[-1]+extra)
            else:
                merged[i-1] = (max(0,prev[0]-extra), prev[-1]+extra)
        return list(filter(None, merged))       
    

    所以我们创建这个跨度列表。我们将extra设置为n-1,因为替换方两边的n-1字母都会受到影响。

    merged = merge_spans(spans, n-1)   
    

    现在我们可以迭代这些跨度并删除受替换影响的窗口的计数。然后我们可以在该范围内进行替换 并更新计数。

    for span in merged:
        sub_s = s[span[0]:span[-1]]
        for window in per_window(sub_s, n):
            char_after[window[:2]][window[2]] -= 1
            char_before[window[1:]][window[0]] -= 1
        new_s = sub_s.replace(''.join(source), target)
        for window in per_window(new_s, n):
            char_after[window[:2]][window[2]] += 1
            char_before[window[1:]][window[0]] += 1 
    

    请注意,上述内容会影响原始的char_beforechar_after词典,但如果您因某种原因需要保留原始计数,则可以先复制它们。

    最后,我们从计数器中移除0或负数的所有计数,并完全删除任何不包含正计数的窗口。请注意,将Counter()添加到计数器会删除任何值为正值的元素。

    char_before2 = {k:v+Counter() for k,v in char_before.items() if any(v.values())}
    char_after2 = {k:v+Counter() for k,v in char_after.items() if any(v.values())}
    

    结果:

    >>> char_before2
    {('d', 'x'): Counter({'c': 1}),
     ('e', 'f'): Counter({'x': 1}),
     ('x', 'e'): Counter({'x': 1}),
     ('x', 'x'): Counter({'d': 1})}
    
    >>> char_after2 
    {('c', 'd'): Counter({'x': 1}),
     ('d', 'x'): Counter({'x': 1}),
     ('x', 'e'): Counter({'f': 1}),
     ('x', 'x'): Counter({'e': 1})}
    

答案 1 :(得分:3)

不是一个真正的答案,但这个评论太长了:

这似乎是一个非常复杂的问题。我不确定它是否真的值得这样做,或者它是否可能。

在您建议的代码中,您没有考虑某些情况。例如,您没有考虑可能的双重替换('ab'中出现s两次)。这就是你获得'x'密钥而不是('x', 'x')的原因。此外,您并不认为您的窗口只是替换序列的一半,这就是您丢失的原因,例如密钥('d', 'x')

另一件事:假设我们从s='cdababaef'开始,然后我们会char_after[('a','b')]['a']=2,对于替换的字符串,我们需要char_after[('x','x')]['a']=1

对于s='cdabaabaef',我们也会获得char_after[('a','b')]['a']=2,但在替换后的字符串中,它将为char_after[('x','x')]['a']=2

我想说的是:我们怎么知道,我们在source(在我们的示例中为'ab')之后计算的角色是否也将被替换? 有关此信息,我们需要在我们的算法中咨询s(除非char_beforechar_after对于他们的输入s是唯一的,但这似乎是另一个复杂的问题。)

在我看来,简单地重新计算会更容易。如果你能负担得起原始序列的运行,你可以再次为更换的序列运行它。否则,这个问题就变成了一个代码优化问题,您可以在code review SE中再次询问。

但也许其他人有一个聪明的想法如何处理这个问题。

答案 2 :(得分:2)

在我看来,最明显的方法是直接搜索字符串,查找源序列的出现次数。随着时间的推移,您将与源字符串不匹配的子字符串复制到新字符串中。找到与源字符串匹配时,将目标序列(而不是源序列)复制到新字符串中。然后扫描替换周围的序列,以确定哪些子串具有替换影响的计数之前/之后,并更新计数。您保存目标插入的位置,并且在您完成替换后,返回并添加替换产生的新计数。

如果我理解bunji的答案是正确的,那么这在概念上与他/她所做的相同。它不漂亮,但这是另一种实现:

from collections import defaultdict
from collections import Counter
import re
from copy import deepcopy

def chars_before_after(s, bin_size):
    def per_window(sequence, n=1):
        """
        From http://stackoverflow.com/q/42220614/610569
            >>> list(per_window([1,2,3,4], n=2))
            [(1, 2), (2, 3), (3, 4)]
            >>> list(per_window([1,2,3,4], n=3))
            [(1, 2, 3), (2, 3, 4)]
        """
        start, stop = 0, n
        seq = list(sequence)
        while stop <= len(seq):
            yield tuple(seq[start:stop])
            start += 1
            stop += 1

    char_before= defaultdict(Counter)
    char_after = defaultdict(Counter)
    for window in per_window(s, bin_size+1):
        char_after[window[:bin_size]][window[-1]] += 1
        char_before[window[1:]][window[0]] += 1

    return char_before, char_after

def replace_chars_recount(s, source, target, char_before, char_after, verbose=False):

    if verbose:
        print('s=' + s + ', source=' + source, 'target=' + target)

        print('char_before')
        for char_counter in char_before.items():
            print(char_counter)

        print('\nchar_after')
        for char_counter in char_after.items():
            print(char_counter)

    char_before = deepcopy(char_before)
    char_after = deepcopy(char_after)
    replaced_s = ''
    source_len = len(source)
    source_start = 0
    source_stop = source_len
    target_pos = []
    target_len = len(target)
    last_replacement = 0

    while source_start < len(s):
        if verbose: print('start_index=' + str(source_start))

        if s[source_start:source_stop] == source:
            replaced_s += target

            before_start = max(source_start-source_len+last_replacement,0)
            before_end = before_start+source_len
            while before_start < source_stop and before_end < len(s):
                before_chars = tuple(s[before_start:before_end])
                if verbose: print('Removing "'+ s[before_end] +'" from after "' + s[before_start:before_end] + '".')
                char_after[before_chars][s[before_end]] -= 1
                before_start += 1
                before_end += 1

            after_end = min(len(s), source_stop+source_len)
            after_start = after_end-source_len
            while after_end > source_start+last_replacement and after_start>0:
                after_chars = tuple(s[after_start:after_end])
                if verbose: print('Removing "' + s[after_start-1] + '" from before "' + s[after_start:after_end] + '".')
                char_before[after_chars][s[after_start-1]] -= 1
                after_start -= 1
                after_end -= 1

            target_pos.append(len(replaced_s) - target_len)
            source_start += source_len
            source_stop += source_len
            last_replacement = source_len

        else:
            replaced_s += s[source_start]
            source_start += 1
            source_stop += 1
            last_replacement = max(0, last_replacement-1)

    last_target = 0-target_len
    for target in target_pos:
        if verbose: print('target_pos=' + str(target))
        before_target_start = max(target-source_len, last_target+target_len, 0)
        before_target_end = before_target_start+source_len
        while before_target_start <= target+target_len-1 and before_target_end < len(replaced_s):
            before_chars = tuple(replaced_s[before_target_start:before_target_end])
            if verbose: print('Adding "' + replaced_s[before_target_end] + '" to after "' + replaced_s[before_target_start:before_target_end] + '".')
            char_after[before_chars][replaced_s[before_target_end]] += 1
            before_target_start += 1
            before_target_end += 1

        after_end = min(len(replaced_s), target + target_len+ source_len)
        after_start = after_end - source_len
        while after_end > max(target, last_target+source_len+target_len) and after_start>0:
            after_chars = tuple(replaced_s[after_start:after_end])
            if verbose: print('Adding "' + replaced_s[after_start - 1] + '" to before "' + replaced_s[after_start:after_end] + '".')
            char_before[after_chars][replaced_s[after_start - 1]] += 1
            after_start -= 1
            after_end -= 1

        last_target=target

    char_before = {k:v+Counter() for k,v in char_before.items() if any(v.values())}
    char_after = {k:v+Counter() for k,v in char_after.items() if any(v.values())}

    if verbose:
        print('replaced_s=' + replaced_s)
        print('char_before')
        for char_counter in char_before.items():
            print(char_counter)

        print('\nchar_after')
        for char_counter in char_after.items():
            print(char_counter)

    return replaced_s, char_before, char_after


def test_replace_chars_recount(s, source, target, verbose=False):
    char_before, char_after = chars_before_after(s, len(source))
    replaced_s, char_before, char_after = replace_chars_recount(s, source, target, char_before, char_after, verbose)
    correct_replaced = re.sub(source, target, s)
    correct_before, correct_after = chars_before_after(replaced_s, len(source))
    correct_answer = correct_replaced==replaced_s and correct_before==char_before and correct_after==char_after

    print('{:>20} {:<20} {:<10} {:<10} {:<5}'.format(s, replaced_s, source, target, str(correct_answer)))

test_cases = [{'s': 'cdababef', 'source': 'ab', 'target': 'x'},
              {'s': 'cdabqabef', 'source': 'ab', 'target': 'x'},
              {'s': 'cdabgabgef', 'source': 'abg', 'target': 'x'},
              {'s': 'cdabgqabgef', 'source': 'abg', 'target': 'x'},
              {'s': 'cdababef', 'source': 'ab', 'target': 'xy'},
              {'s': 'cdababef', 'source': 'ab', 'target': 'xyz'},
              {'s': 'cdababef', 'source': 'a', 'target': 'x'},
              {'s': 'cdababef', 'source': 'a', 'target': 'xyz'},
              {'s': 'ababef', 'source': 'ab', 'target': 'x'},
              {'s': 'cdabab', 'source': 'ab', 'target': 'x'},
              {'s': 'cdababef', 'source': 'xy', 'target': 'x'},
              {'s': 'cdababef', 'source': 'ab', 'target': ''},
              {'s': 'cdabcdabcdef', 'source': 'abcd', 'target': 'x'},
              {'s': 'cdabcdeabcdeabcdeef', 'source': 'abcde', 'target': 'x'},
              {'s': 'cdababef', 'source': 'a', 'target': 'abcd'},
              {'s': 'aaaaa', 'source': 'a', 'target': 'x'},
              {'s': 'aaaaa', 'source': 'a', 'target': 'xy'},
              {'s': '', 'source': '', 'target': ''}]

print('{:>20} {:<20} {:<10} {:<10} {:<5}'.format('Input String', 'Output String', 'Source', 'Target', 'Correct Result?'))

for test_case in test_cases:
    test_replace_chars_recount(test_case['s'], test_case['source'], test_case['target'])

输出结果为:

    Input String Output String        Source     Target     Correct Result?
            cdababef cdxxef               ab         x          True 
           cdabqabef cdxqxef              ab         x          True 
          cdabgabgef cdxxef               abg        x          True 
         cdabgqabgef cdxqxef              abg        x          True 
            cdababef cdxyxyef             ab         xy         True 
            cdababef cdxyzxyzef           ab         xyz        True 
            cdababef cdxbxbef             a          x          True 
            cdababef cdxyzbxyzbef         a          xyz        True 
              ababef xxef                 ab         x          True 
              cdabab cdxx                 ab         x          True 
            cdababef cdababef             xy         x          True 
            cdababef cdef                 ab                    True 
        cdabcdabcdef cdxxef               abcd       x          True 
 cdabcdeabcdeabcdeef cdxxxef              abcde      x          True 
            cdababef cdabcdbabcdbef       a          abcd       True 
               aaaaa xxxxx                a          x          True 
               aaaaa xyxyxyxyxy           a          xy         True 
                                                                True 

因此无论源/目标长度如何,此方法都有效。当前实现中的一个限制是源长度必须与前/后字符计数的bin大小相同。但是,您可以更改此设置,以便更灵活。