查找包含线性时间内某些字符的最短子字符串

时间:2015-07-24 10:26:02

标签: python string algorithm

  

目标:实施一种算法,在给定字符串ab的情况下,返回包含a所有字符的b的最短子字符串。字符串b可以包含重复项。

算法基本上就是这个:
http://www.geeksforgeeks.org/find-the-smallest-window-in-a-string-containing-all-characters-of-another-string/

在链接的文章中,算法只找到最短子串的长度,但这是一个微小的变化。

这是我的实施:

导入集合

def issubset(c1, c2):
    '''Return True if c1 is a subset of c2, False otherwise.'''
    return not c1 - (c1 & c2)


def min_idx(seq, target):
    '''Least index of seq such that seq[idx] is contained in target.'''
    for idx, elem in enumerate(seq):
        if elem in target:
            return idx


def minsub(a, b):
    target_hist = collections.Counter(b)
    current_hist = collections.Counter()
    # Skip all the useless characters
    idx = min_idx(a, target_hist)
    if idx is None:
        return []
    a = a[idx:]
    # Build a base substring
    i = iter(a)
    current = []
    while not issubset(target_hist, current_hist):
        t = next(i)
        current.append(t)
        current_hist[t] += 1
    minlen = len(current)
    shortest = current
    for t in i:
        current.append(t)
        # Shorten the substring from the front as much as possible
        if t == current[0]:
            idx = min_idx(current[1:], target_hist) + 1
            current = current[idx:]
            if len(current) < minlen:
                minlen = len(current)
                shortest = current
    return current

不幸的是,它不起作用。例如,

>>> minsub('this is a test string', 'tist')
['s', ' ', 'i', 's', ' ', 'a', ' ', 't', 'e', 's', 't', ' ', 's', 't', 'r', 'i', 'n', 'g'

我缺少什么?
旁注:我不确定我的实现是否为O(n),但这是一个不同的问题。至于现在,我正在寻找修复我的实现。

编辑:看似有效的解决方案:

import collections


def issubset(c1, c2):
    '''Return True if c1 is a subset of c2, False otherwise.'''
    return not c1 - (c1 & c2)


def min_idx(seq, target):
    '''Least index of seq such that seq[idx] is contained in target.'''
    for idx, elem in enumerate(seq):
        if elem in target:
            return idx


def minsub(a, b):
    target_hist = collections.Counter(b)
    current_hist = collections.Counter()
    # Skip all the useless characters
    idx = min_idx(a, target_hist)
    if idx is None:
        return []
    a = a[idx:]
    # Build a base substring
    i = iter(a)
    current = []
    while not issubset(target_hist, current_hist):
        t = next(i)
        current.append(t)
        current_hist[t] += 1
    minlen = len(current)
    shortest = current[:]
    for t in i:
        current.append(t)
        # Shorten the substring from the front as much as possible
        if t == current[0]:
            current_hist = collections.Counter(current)
            for idx, elem in enumerate(current[1:], 1):
                if not current_hist[elem] - target_hist[elem]:
                    break
                current_hist[elem] -= 1
            current = current[idx:]
            if len(current) < minlen:
                minlen = len(current)
                shortest = current[:]
    return shortest

1 个答案:

答案 0 :(得分:1)

问题在于此步骤,当我们向current添加一个字符并且它与第一个字符匹配时:

  

删除最左边的字符以及最左边的字符后的所有其他额外字符。

idx

的值
            idx = min_idx(current[1:], target_hist) + 1

有时低于预期:只要idxcurrent_hist的子集,target_hist就会增加。因此,我们需要让current_hist保持最新状态,以便为idx计算正确的值。另外,minsub应该返回shortest而不是current

def minsub(a, b):
    target_hist = collections.Counter(b)
    current_hist = collections.Counter()
    # Skip all the useless characters
    idx = min_idx(a, target_hist)
    if idx is None:
        return []
    a = a[idx:]
    # Build a base substring
    i = iter(a)
    current = []
    while not issubset(target_hist, current_hist):
        t = next(i)
        current.append(t)
        if t in target_hist:
            current_hist[t] += 1
    minlen = len(current)
    shortest = current
    #current = []
    for t in i:
        current.append(t)
        current_hist[t] += 1
        # Shorten the substring from the front as much as possible
        if t == current[0]:
            #idx = min_idx(current[1:], target_hist) + 1
            idx = 0
            while issubset(target_hist, current_hist):
                u = current[idx]
                current_hist[u] -= 1
                idx += 1
            idx -= 1
            u = current[idx]
            current_hist[u] += 1
            current = current[idx:]
        if len(current) < minlen:
            minlen = len(current)
            shortest = current[:]
    return shortest
In [9]: minsub('this is a test string', 'tist')
Out[9]: ['t', ' ', 's', 't', 'r', 'i']