给出一个字符串:
s = 'cdababef'
我们计算之前的角色和之后的角色:
def per_window(sequence, n=1):
"""
From http://stackoverflow.com/q/42220614/610569
>>> list(per_window([1,2,3,4], n=2))
[(1, 2), (2, 3), (3, 4)]
>>> list(per_window([1,2,3,4], n=3))
[(1, 2, 3), (2, 3, 4)]
"""
start, stop = 0, n
seq = list(sequence)
while stop <= len(seq):
yield tuple(seq[start:stop])
start += 1
stop += 1
char_before= defaultdict(Counter)
char_after = defaultdict(Counter)
for window in per_window(s, 3):
char_after[window[:2]][window[2]] += 1
char_before[window[1:]][window[0]] += 1
[OUT]:
>>> char_after
defaultdict(collections.Counter,
{('a', 'b'): Counter({'a': 1, 'e': 1}),
('b', 'a'): Counter({'b': 1}),
('b', 'e'): Counter({'f': 1}),
('c', 'd'): Counter({'a': 1}),
('d', 'a'): Counter({'b': 1})})
>>> char_before
defaultdict(collections.Counter,
{('a', 'b'): Counter({'b': 1, 'd': 1}),
('b', 'a'): Counter({'a': 1}),
('b', 'e'): Counter({'a': 1}),
('d', 'a'): Counter({'c': 1}),
('e', 'f'): Counter({'b': 1})})
如果我将ab
的所有实例替换为x
,我们需要更新char_after
和char_before
计数,目标是实现而不重新计算s = 'cdxxef'
的所有子字符串,如:
s = 'cdxxef'
char_before2 = defaultdict(Counter)
char_after2 = defaultdict(Counter)
for window in per_window(s, 3):
char_after2[window[:2]][window[2]] += 1
char_before2[window[1:]][window[0]] += 1
[期望的输出]:
>>> char_before2
defaultdict(collections.Counter,
{('d', 'x'): Counter({'c': 1}),
('e', 'f'): Counter({'x': 1}),
('x', 'e'): Counter({'x': 1}),
('x', 'x'): Counter({'d': 1})})
>>> char_after2
defaultdict(collections.Counter,
{('c', 'd'): Counter({'x': 1}),
('d', 'x'): Counter({'x': 1}),
('x', 'e'): Counter({'f': 1}),
('x', 'x'): Counter({'e': 1})})
如何在不重新计算所有子字符串但仅重新计算受替换影响的子字符串的情况下完成子字符串的更新?
我试过了:
s = 'cdababef'
char_before= defaultdict(Counter)
char_after = defaultdict(Counter)
for window in per_window(s, 3):
char_after[window[:2]][window[2]] += 1
char_before[window[1:]][window[0]] += 1
source, target = ('a', 'b'), 'x'
for ch in char_before[source]:
count_before = char_before[source][ch]
char_before[target][ch] += count_before
char_before[source][ch] = 0
count_after = char_after[source][ch]
char_after[target][ch] += count_after
char_before[source][ch] = 0
但输出不是char_before2
和char_after2
所需的输出:
>>> char_before
defaultdict(collections.Counter,
{'x': Counter({'b': 1, 'd': 1}),
('b', 'a'): Counter({'a': 1}),
('d', 'a'): Counter({'c': 1}),
('b', 'e'): Counter({'a': 1}),
('a', 'b'): Counter({'b': 0, 'd': 0}),
('e', 'f'): Counter({'b': 1})})
>>> char_after
defaultdict(collections.Counter,
{'x': Counter({'b': 0, 'd': 0}),
('b', 'a'): Counter({'b': 1}),
('d', 'a'): Counter({'b': 1}),
('b', 'e'): Counter({'f': 1}),
('a', 'b'): Counter({'a': 1, 'e': 1}),
('c', 'd'): Counter({'a': 1})})
答案 0 :(得分:5)
以下是一种通过三个步骤解决此问题的方法:
char_before
和char_after
词典中删除计数char_before
和char_after
字典首先让我们先定义一些变量并运行初始代码。
source, target = ('a', 'b'), 'x'
n = 3
char_before= defaultdict(Counter)
char_after = defaultdict(Counter)
for window in per_window(s, n):
char_after[window[:2]][window[2]] += 1
char_before[window[1:]][window[0]] += 1
现在我们找到要替换的子串的跨度(开始和结束索引)(注意我们实际上还没有进行任何替换)
import re
spans = [m.span() for m in re.finditer(''.join(source), s)]
但是我们知道落入其中一个跨度的窗口的前后计数并不是唯一会受到替换影响的窗口。直接位于其中一个跨度之前或之后的任何窗口也将受到影响。例如,在s = 'cdababef'
中,如果我们将'ab'
替换为'x'
,则初始'cd'
将需要更新char_after
计数,即使'cd'
没有{ {1}}本身已被替换。
为了解决这个问题,我们定义了一个名为merge_spans
的函数,它不仅可以合并相邻的跨度((2,4)
和(4,6)
变为(2,6)
),还可以合并extra
内的跨距。 1}}彼此的空格(其中extra
是由关键字参数定义的整数)。这背后的直觉是,这将返回一个跨度列表,其中跨度对应于更换影响计数之前/之后的所有子串。
def merge_spans(spans, extra = 0):
extra = max(0,extra)
merged = spans[:]
if len(merged) == 1:
return [(max(merged[0][0]-extra, 0), merged[0][-1]+extra)]
for i in range(1, len(merged)):
span = merged[i]
prev = merged[i-1]
if prev[-1]+extra >= span[0]-extra:
merged[i] = (max(0,prev[0]-extra), span[-1]+extra)
merged[i-1] = ()
elif i == len(merged)-1:
merged[i] = (max(0,span[0]-extra), span[-1]+extra)
merged[i-1] = (max(0,prev[0]-extra), prev[-1]+extra)
else:
merged[i-1] = (max(0,prev[0]-extra), prev[-1]+extra)
return list(filter(None, merged))
所以我们创建这个跨度列表。我们将extra
设置为n-1
,因为替换方两边的n-1
字母都会受到影响。
merged = merge_spans(spans, n-1)
现在我们可以迭代这些跨度并删除受替换影响的窗口的计数。然后我们可以在该范围内进行替换 并更新计数。
for span in merged:
sub_s = s[span[0]:span[-1]]
for window in per_window(sub_s, n):
char_after[window[:2]][window[2]] -= 1
char_before[window[1:]][window[0]] -= 1
new_s = sub_s.replace(''.join(source), target)
for window in per_window(new_s, n):
char_after[window[:2]][window[2]] += 1
char_before[window[1:]][window[0]] += 1
请注意,上述内容会影响原始的char_before
和char_after
词典,但如果您因某种原因需要保留原始计数,则可以先复制它们。
最后,我们从计数器中移除0
或负数的所有计数,并完全删除任何不包含正计数的窗口。请注意,将Counter()
添加到计数器会删除任何值为正值的元素。
char_before2 = {k:v+Counter() for k,v in char_before.items() if any(v.values())}
char_after2 = {k:v+Counter() for k,v in char_after.items() if any(v.values())}
结果:
>>> char_before2
{('d', 'x'): Counter({'c': 1}),
('e', 'f'): Counter({'x': 1}),
('x', 'e'): Counter({'x': 1}),
('x', 'x'): Counter({'d': 1})}
>>> char_after2
{('c', 'd'): Counter({'x': 1}),
('d', 'x'): Counter({'x': 1}),
('x', 'e'): Counter({'f': 1}),
('x', 'x'): Counter({'e': 1})}
答案 1 :(得分:3)
不是一个真正的答案,但这个评论太长了:
这似乎是一个非常复杂的问题。我不确定它是否真的值得这样做,或者它是否可能。
在您建议的代码中,您没有考虑某些情况。例如,您没有考虑可能的双重替换('ab'
中出现s
两次)。这就是你获得'x'
密钥而不是('x', 'x')
的原因。此外,您并不认为您的窗口只是替换序列的一半,这就是您丢失的原因,例如密钥('d', 'x')
。
另一件事:假设我们从s='cdababaef'
开始,然后我们会char_after[('a','b')]['a']=2
,对于替换的字符串,我们需要char_after[('x','x')]['a']=1
。
对于s='cdabaabaef'
,我们也会获得char_after[('a','b')]['a']=2
,但在替换后的字符串中,它将为char_after[('x','x')]['a']=2
。
我想说的是:我们怎么知道,我们在source
(在我们的示例中为'ab'
)之后计算的角色是否也将被替换? 有关此信息,我们需要在我们的算法中咨询s
(除非char_before
和char_after
对于他们的输入s
是唯一的,但这似乎是另一个复杂的问题。)
在我看来,简单地重新计算会更容易。如果你能负担得起原始序列的运行,你可以再次为更换的序列运行它。否则,这个问题就变成了一个代码优化问题,您可以在code review SE中再次询问。
但也许其他人有一个聪明的想法如何处理这个问题。
答案 2 :(得分:2)
在我看来,最明显的方法是直接搜索字符串,查找源序列的出现次数。随着时间的推移,您将与源字符串不匹配的子字符串复制到新字符串中。找到与源字符串匹配时,将目标序列(而不是源序列)复制到新字符串中。然后扫描替换周围的序列,以确定哪些子串具有替换影响的计数之前/之后,并更新计数。您保存目标插入的位置,并且在您完成替换后,返回并添加替换产生的新计数。
如果我理解bunji的答案是正确的,那么这在概念上与他/她所做的相同。它不漂亮,但这是另一种实现:
from collections import defaultdict
from collections import Counter
import re
from copy import deepcopy
def chars_before_after(s, bin_size):
def per_window(sequence, n=1):
"""
From http://stackoverflow.com/q/42220614/610569
>>> list(per_window([1,2,3,4], n=2))
[(1, 2), (2, 3), (3, 4)]
>>> list(per_window([1,2,3,4], n=3))
[(1, 2, 3), (2, 3, 4)]
"""
start, stop = 0, n
seq = list(sequence)
while stop <= len(seq):
yield tuple(seq[start:stop])
start += 1
stop += 1
char_before= defaultdict(Counter)
char_after = defaultdict(Counter)
for window in per_window(s, bin_size+1):
char_after[window[:bin_size]][window[-1]] += 1
char_before[window[1:]][window[0]] += 1
return char_before, char_after
def replace_chars_recount(s, source, target, char_before, char_after, verbose=False):
if verbose:
print('s=' + s + ', source=' + source, 'target=' + target)
print('char_before')
for char_counter in char_before.items():
print(char_counter)
print('\nchar_after')
for char_counter in char_after.items():
print(char_counter)
char_before = deepcopy(char_before)
char_after = deepcopy(char_after)
replaced_s = ''
source_len = len(source)
source_start = 0
source_stop = source_len
target_pos = []
target_len = len(target)
last_replacement = 0
while source_start < len(s):
if verbose: print('start_index=' + str(source_start))
if s[source_start:source_stop] == source:
replaced_s += target
before_start = max(source_start-source_len+last_replacement,0)
before_end = before_start+source_len
while before_start < source_stop and before_end < len(s):
before_chars = tuple(s[before_start:before_end])
if verbose: print('Removing "'+ s[before_end] +'" from after "' + s[before_start:before_end] + '".')
char_after[before_chars][s[before_end]] -= 1
before_start += 1
before_end += 1
after_end = min(len(s), source_stop+source_len)
after_start = after_end-source_len
while after_end > source_start+last_replacement and after_start>0:
after_chars = tuple(s[after_start:after_end])
if verbose: print('Removing "' + s[after_start-1] + '" from before "' + s[after_start:after_end] + '".')
char_before[after_chars][s[after_start-1]] -= 1
after_start -= 1
after_end -= 1
target_pos.append(len(replaced_s) - target_len)
source_start += source_len
source_stop += source_len
last_replacement = source_len
else:
replaced_s += s[source_start]
source_start += 1
source_stop += 1
last_replacement = max(0, last_replacement-1)
last_target = 0-target_len
for target in target_pos:
if verbose: print('target_pos=' + str(target))
before_target_start = max(target-source_len, last_target+target_len, 0)
before_target_end = before_target_start+source_len
while before_target_start <= target+target_len-1 and before_target_end < len(replaced_s):
before_chars = tuple(replaced_s[before_target_start:before_target_end])
if verbose: print('Adding "' + replaced_s[before_target_end] + '" to after "' + replaced_s[before_target_start:before_target_end] + '".')
char_after[before_chars][replaced_s[before_target_end]] += 1
before_target_start += 1
before_target_end += 1
after_end = min(len(replaced_s), target + target_len+ source_len)
after_start = after_end - source_len
while after_end > max(target, last_target+source_len+target_len) and after_start>0:
after_chars = tuple(replaced_s[after_start:after_end])
if verbose: print('Adding "' + replaced_s[after_start - 1] + '" to before "' + replaced_s[after_start:after_end] + '".')
char_before[after_chars][replaced_s[after_start - 1]] += 1
after_start -= 1
after_end -= 1
last_target=target
char_before = {k:v+Counter() for k,v in char_before.items() if any(v.values())}
char_after = {k:v+Counter() for k,v in char_after.items() if any(v.values())}
if verbose:
print('replaced_s=' + replaced_s)
print('char_before')
for char_counter in char_before.items():
print(char_counter)
print('\nchar_after')
for char_counter in char_after.items():
print(char_counter)
return replaced_s, char_before, char_after
def test_replace_chars_recount(s, source, target, verbose=False):
char_before, char_after = chars_before_after(s, len(source))
replaced_s, char_before, char_after = replace_chars_recount(s, source, target, char_before, char_after, verbose)
correct_replaced = re.sub(source, target, s)
correct_before, correct_after = chars_before_after(replaced_s, len(source))
correct_answer = correct_replaced==replaced_s and correct_before==char_before and correct_after==char_after
print('{:>20} {:<20} {:<10} {:<10} {:<5}'.format(s, replaced_s, source, target, str(correct_answer)))
test_cases = [{'s': 'cdababef', 'source': 'ab', 'target': 'x'},
{'s': 'cdabqabef', 'source': 'ab', 'target': 'x'},
{'s': 'cdabgabgef', 'source': 'abg', 'target': 'x'},
{'s': 'cdabgqabgef', 'source': 'abg', 'target': 'x'},
{'s': 'cdababef', 'source': 'ab', 'target': 'xy'},
{'s': 'cdababef', 'source': 'ab', 'target': 'xyz'},
{'s': 'cdababef', 'source': 'a', 'target': 'x'},
{'s': 'cdababef', 'source': 'a', 'target': 'xyz'},
{'s': 'ababef', 'source': 'ab', 'target': 'x'},
{'s': 'cdabab', 'source': 'ab', 'target': 'x'},
{'s': 'cdababef', 'source': 'xy', 'target': 'x'},
{'s': 'cdababef', 'source': 'ab', 'target': ''},
{'s': 'cdabcdabcdef', 'source': 'abcd', 'target': 'x'},
{'s': 'cdabcdeabcdeabcdeef', 'source': 'abcde', 'target': 'x'},
{'s': 'cdababef', 'source': 'a', 'target': 'abcd'},
{'s': 'aaaaa', 'source': 'a', 'target': 'x'},
{'s': 'aaaaa', 'source': 'a', 'target': 'xy'},
{'s': '', 'source': '', 'target': ''}]
print('{:>20} {:<20} {:<10} {:<10} {:<5}'.format('Input String', 'Output String', 'Source', 'Target', 'Correct Result?'))
for test_case in test_cases:
test_replace_chars_recount(test_case['s'], test_case['source'], test_case['target'])
输出结果为:
Input String Output String Source Target Correct Result?
cdababef cdxxef ab x True
cdabqabef cdxqxef ab x True
cdabgabgef cdxxef abg x True
cdabgqabgef cdxqxef abg x True
cdababef cdxyxyef ab xy True
cdababef cdxyzxyzef ab xyz True
cdababef cdxbxbef a x True
cdababef cdxyzbxyzbef a xyz True
ababef xxef ab x True
cdabab cdxx ab x True
cdababef cdababef xy x True
cdababef cdef ab True
cdabcdabcdef cdxxef abcd x True
cdabcdeabcdeabcdeef cdxxxef abcde x True
cdababef cdabcdbabcdbef a abcd True
aaaaa xxxxx a x True
aaaaa xyxyxyxyxy a xy True
True
因此无论源/目标长度如何,此方法都有效。当前实现中的一个限制是源长度必须与前/后字符计数的bin大小相同。但是,您可以更改此设置,以便更灵活。