Question

该算法的复杂性是用于查找包含所有搜索关键字的最小片段？

Answer 1

如上所述，问题通过一个相当简单的算法解决：

只需从头开始按顺序查看输入文本并检查每个单词：它是否在搜索键中。如果单词在键中，则将其添加到我们将调用当前块的结构的末尾。当前块只是一个线性的单词序列，每个单词都附有一个在文本中找到它的位置。当前块必须保持以下属性：当前块中的第一个字必须一次出现在当前块中。如果将新单词添加到“当前块”的末尾，并且违反了上述属性，则必须从块中删除第一个单词。此过程称为当前块的规范化。规范化是一个潜在的迭代过程，因为一旦从块中删除了第一个单词，新的第一个单词也可能违反了该属性，因此您也必须将其删除。等等。

因此，基本上当前块是一个FIFO序列：新词到达右端，并从左端通过规范化过程删除。

要解决问题所需要做的就是查看文本，维护当前块，在必要时对其进行规范化，使其满足“属性”。您构建的所有关键字中最短的块就是问题的答案。

例如，考虑文本

CxxxAxxxBxxAxxCxBAxxxC

使用关键字A，B和C.查看文本，您将构建以下块序列

C
CA
CAB - all words, length 9 (CxxxAxxxB...)
CABA - all words, length 12 (CxxxAxxxBxxA...)
CABAC - violates The Property, remove first C
ABAC - violates The Property, remove first A
BAC - all words, length 7 (...BxxAxxC...)
BACB - violates The Property, remove first B
ACB - all words, length 6 (...AxxCxB...)
ACBA - violates The Property, remove first A
CBA - all words, length 4 (...CxBA...)
CBAC - violates The Property, remove first C
BAC - all words, length 6 (...BAxxxC)

我们构建的最佳块长度为4，这是本案的答案

CxxxAxxxBxxAxx CxBA xxxC

此算法的确切复杂性取决于输入，因为它决定了规范化过程将进行多少次迭代，但忽略规范化，复杂性通常为O(N * log M)，其中N为数字文本中的单词和M是关键字的数量，O(log M)是检查当前单词是否属于关键字集的复杂性。

现在，说完了，我不得不承认我怀疑这可能不是你所需要的。由于您在标题中提到了Google，因此您在帖子中提供的问题声明可能不完整。也许在你的情况下文本被索引？（使用索引上述算法仍然适用，只是变得更有效率）。也许有一些棘手的数据库描述文本并允许更有效的解决方案（如不查看整个文本）？我只能猜测，你不是在说......

Answer 2

我认为AndreyT提出的解决方案假设关键字/搜索字词中不存在重复项。此外，如果文本包含大量重复的关键字，则当前块可以与文本本身一样大。例如：正文：'ABBBBBBBBBB' 关键字文字：'AB' 当前区块：'ABBBBBBBBBB'

无论如何，我已经在C＃中实现了一些基本测试，很高兴得到一些关于它是否有效的反馈：）

    static string FindMinWindow(string text, string searchTerms)
    {
        Dictionary<char, bool> searchIndex = new Dictionary<char, bool>();
        foreach (var item in searchTerms)
        {
            searchIndex.Add(item, false);
        }

        Queue<Tuple<char, int>> currentBlock = new Queue<Tuple<char, int>>();
        int noOfMatches = 0;
        int minLength = Int32.MaxValue;
        int startIndex = 0;
        for(int i = 0; i < text.Length; i++)
        {
            char item = text[i];
            if (searchIndex.ContainsKey(item))
            {
                if (!searchIndex[item])
                {
                    noOfMatches++;
                }

                searchIndex[item] = true;
                var newEntry = new Tuple<char, int> ( item, i );
                currentBlock.Enqueue(newEntry);

                // Normalization step.
                while (currentBlock.Count(o => o.Item1.Equals(currentBlock.First().Item1)) > 1)
                {
                    currentBlock.Dequeue();
                }

                // Figuring out minimum length.
                if (noOfMatches == searchTerms.Length)
                {
                    var length = currentBlock.Last().Item2 - currentBlock.First().Item2 + 1;
                    if (length < minLength)
                    {
                        startIndex = currentBlock.First().Item2;
                        minLength = length;
                    }
                }
            }
        }
        return noOfMatches == searchTerms.Length ? text.Substring(startIndex, minLength) : String.Empty;
    }

Answer 3

这是一个有趣的问题。更正式地重申：给定长度为n的列表L（网页）和大小为k的集合S（查询），找到包含S的所有元素的L的最小子列表。

我将从一个强力解决方案开始，希望鼓励其他人击败它。请注意，在一次通过集合之后，集合成员资格可以在恒定时间内完成。见this question。另请注意，这假设S的所有元素实际上都在L中，否则它只会将子列表从1返回到n。

best = (1,n)
For i from 1 to n-k:  
  Create/reset a hash found[] mapping each element of S to False.
  For j from i to n or until counter == k:  
    If found[L[j]] then counter++ and let found[L[j]] = True;
  If j-i < best[2]-best[1] then let best = (i,j).

时间复杂度为O（（n + k）（n-k））。即，n ^ 2-ish。

Answer 4

这是使用Java 8的解决方案。

static Map.Entry<Integer, Integer> documentSearch(Collection<String> document, Collection<String> query) {
    Queue<KeywordIndexPair> queue = new ArrayDeque<>(query.size());
    HashSet<String> words = new HashSet<>();

    query.stream()
        .forEach(words::add);

    AtomicInteger idx = new AtomicInteger();
    IndexPair interval = new IndexPair(0, Integer.MAX_VALUE);
    AtomicInteger size = new AtomicInteger();
    document.stream()
        .map(w -> new KeywordIndexPair(w, idx.getAndIncrement()))
        .filter(pair -> words.contains(pair.word)) // Queue.contains is O(n) so we trade space for efficiency
        .forEach(pair -> {
            // only the first and last elements are useful to the algorithm, so we don't bother removing
            // an element from any other index. note that removing an element using equality
            // from an ArrayDeque is O(n)
            KeywordIndexPair first = queue.peek();
            if (pair.equals(first)) {
                queue.remove();
            }
            queue.add(pair);
            first = queue.peek();
            int diff = pair.index - first.index;
            if (size.incrementAndGet() == words.size() && diff < interval.interval()) {
                interval.begin = first.index;
                interval.end = pair.index;
                size.set(0);
            }
        });

    return new AbstractMap.SimpleImmutableEntry<>(interval.begin, interval.end);
}

有2个静态嵌套类KeywordIndexPair和IndexPair，其实现应从名称中明显看出。使用支持元组的更智能的编程语言，这些类不是必需的。

测试：

文件：苹果，香蕉，苹果，苹果，狗，猫，苹果，狗，香蕉，苹果，猫，狗

查询：banana，cat

时间间隔：8,10

Answer 5

对于所有单词，请保留最小和最大索引，以防出现多个条目。如果不是，最小值和混合索引将相同。

import edu.princeton.cs.algs4.ST;

public class DicMN {

    ST<String, Words> st = new ST<>();

    public class Words {
        int min;
        int max;
        public Words(int index) {
            min = index;
            max = index;
        }
    }

    public int findMinInterval(String[] sw) {

        int begin = Integer.MAX_VALUE;
        int end = Integer.MIN_VALUE;
        for (int i = 0; i < sw.length; i++) {
            if (st.contains(sw[i])) {
                Words w = st.get(sw[i]);
                begin = Math.min(begin, w.min);
                end = Math.max(end, w.max);
            }
        }

        if (begin != Integer.MAX_VALUE) {
            return (end - begin) + 1;
        }
        return 0;
    }

    public void put(String[] dw) {

        for (int i = 0; i < dw.length; i++) {
            if (!st.contains(dw[i])) {
                st.put(dw[i], new Words(i));
            }
            else {
                Words w = st.get(dw[i]);
                w.min = Math.min(w.min, i);
                w.max = Math.max(w.max, i);
            }
        }
    }

    public static void main(String[] args) {

        // TODO Auto-generated method stub
        DicMN dic = new DicMN();
        String[] arr1 = { "one", "two", "three", "four", "five", "six", "seven", "eight" };
        dic.put(arr1);
        String[] arr2 = { "two", "five" };
        System.out.print("Interval:" + dic.findMinInterval(arr2));
    }
}

Google搜索结果：如何查找包含所有搜索关键字的最小窗口？

5 个答案: