Question

目标：实施一种算法，在给定字符串a和b的情况下，返回包含a所有字符的b的最短子字符串。字符串b可以包含重复项。

算法基本上就是这个：
http://www.geeksforgeeks.org/find-the-smallest-window-in-a-string-containing-all-characters-of-another-string/

在链接的文章中，算法只找到最短子串的长度，但这是一个微小的变化。

这是我的实施：

导入集合

def issubset(c1, c2):
    '''Return True if c1 is a subset of c2, False otherwise.'''
    return not c1 - (c1 & c2)


def min_idx(seq, target):
    '''Least index of seq such that seq[idx] is contained in target.'''
    for idx, elem in enumerate(seq):
        if elem in target:
            return idx


def minsub(a, b):
    target_hist = collections.Counter(b)
    current_hist = collections.Counter()
    # Skip all the useless characters
    idx = min_idx(a, target_hist)
    if idx is None:
        return []
    a = a[idx:]
    # Build a base substring
    i = iter(a)
    current = []
    while not issubset(target_hist, current_hist):
        t = next(i)
        current.append(t)
        current_hist[t] += 1
    minlen = len(current)
    shortest = current
    for t in i:
        current.append(t)
        # Shorten the substring from the front as much as possible
        if t == current[0]:
            idx = min_idx(current[1:], target_hist) + 1
            current = current[idx:]
            if len(current) < minlen:
                minlen = len(current)
                shortest = current
    return current

不幸的是，它不起作用。例如，

>>> minsub('this is a test string', 'tist')
['s', ' ', 'i', 's', ' ', 'a', ' ', 't', 'e', 's', 't', ' ', 's', 't', 'r', 'i', 'n', 'g'

我缺少什么？
旁注：我不确定我的实现是否为O（n），但这是一个不同的问题。至于现在，我正在寻找修复我的实现。

编辑：看似有效的解决方案：

import collections


def issubset(c1, c2):
    '''Return True if c1 is a subset of c2, False otherwise.'''
    return not c1 - (c1 & c2)


def min_idx(seq, target):
    '''Least index of seq such that seq[idx] is contained in target.'''
    for idx, elem in enumerate(seq):
        if elem in target:
            return idx


def minsub(a, b):
    target_hist = collections.Counter(b)
    current_hist = collections.Counter()
    # Skip all the useless characters
    idx = min_idx(a, target_hist)
    if idx is None:
        return []
    a = a[idx:]
    # Build a base substring
    i = iter(a)
    current = []
    while not issubset(target_hist, current_hist):
        t = next(i)
        current.append(t)
        current_hist[t] += 1
    minlen = len(current)
    shortest = current[:]
    for t in i:
        current.append(t)
        # Shorten the substring from the front as much as possible
        if t == current[0]:
            current_hist = collections.Counter(current)
            for idx, elem in enumerate(current[1:], 1):
                if not current_hist[elem] - target_hist[elem]:
                    break
                current_hist[elem] -= 1
            current = current[idx:]
            if len(current) < minlen:
                minlen = len(current)
                shortest = current[:]
    return shortest

Answer 1

问题在于此步骤，当我们向current添加一个字符并且它与第一个字符匹配时：

删除最左边的字符以及最左边的字符后的所有其他额外字符。

此idx

的值

            idx = min_idx(current[1:], target_hist) + 1

有时低于预期：只要idx是current_hist的子集，target_hist就会增加。因此，我们需要让current_hist保持最新状态，以便为idx计算正确的值。另外，minsub应该返回shortest而不是current。

def minsub(a, b):
    target_hist = collections.Counter(b)
    current_hist = collections.Counter()
    # Skip all the useless characters
    idx = min_idx(a, target_hist)
    if idx is None:
        return []
    a = a[idx:]
    # Build a base substring
    i = iter(a)
    current = []
    while not issubset(target_hist, current_hist):
        t = next(i)
        current.append(t)
        if t in target_hist:
            current_hist[t] += 1
    minlen = len(current)
    shortest = current
    #current = []
    for t in i:
        current.append(t)
        current_hist[t] += 1
        # Shorten the substring from the front as much as possible
        if t == current[0]:
            #idx = min_idx(current[1:], target_hist) + 1
            idx = 0
            while issubset(target_hist, current_hist):
                u = current[idx]
                current_hist[u] -= 1
                idx += 1
            idx -= 1
            u = current[idx]
            current_hist[u] += 1
            current = current[idx:]
        if len(current) < minlen:
            minlen = len(current)
            shortest = current[:]
    return shortest

In [9]: minsub('this is a test string', 'tist')
Out[9]: ['t', ' ', 's', 't', 'r', 'i']

查找包含线性时间内某些字符的最短子字符串

1 个答案: