Question

使用LINQ

在IEnumerable<T>内查找序列的最有效方法是什么？

我希望能够创建一个允许以下调用的扩展方法：

int startIndex = largeSequence.FindSequence(subSequence)

匹配必须相邻且有序。

Answer 1

这是一个算法的实现，它在序列中查找子序列。我调用了方法IndexOfSequence，因为它使意图更加明确，并且类似于现有的IndexOf方法：

public static class ExtensionMethods
{
    public static int IndexOfSequence<T>(this IEnumerable<T> source, IEnumerable<T> sequence)
    {
        return source.IndexOfSequence(sequence, EqualityComparer<T>.Default);
    }

    public static int IndexOfSequence<T>(this IEnumerable<T> source, IEnumerable<T> sequence, IEqualityComparer<T> comparer)
    {
        var seq = sequence.ToArray();

        int p = 0; // current position in source sequence
        int i = 0; // current position in searched sequence
        var prospects = new List<int>(); // list of prospective matches
        foreach (var item in source)
        {
            // Remove bad prospective matches
            prospects.RemoveAll(k => !comparer.Equals(item, seq[p - k]));

            // Is it the start of a prospective match ?
            if (comparer.Equals(item, seq[0]))
            {
                prospects.Add(p);
            }

            // Does current character continues partial match ?
            if (comparer.Equals(item, seq[i]))
            {
                i++;
                // Do we have a complete match ?
                if (i == seq.Length)
                {
                    // Bingo !
                    return p - seq.Length + 1;
                }
            }
            else // Mismatch
            {
                // Do we have prospective matches to fall back to ?
                if (prospects.Count > 0)
                {
                    // Yes, use the first one
                    int k = prospects[0];
                    i = p - k + 1;
                }
                else
                {
                    // No, start from beginning of searched sequence
                    i = 0;
                }
            }
            p++;
        }
        // No match
        return -1;
    }
}

我没有对它进行全面测试，因此它可能仍然包含错误。我只是对众所周知的角落情况进行了一些测试，以确保我没有陷入明显的陷阱。到目前为止似乎工作正常......

我认为复杂性接近于O（n），但我不是Big O符号的专家，所以我可能错了......至少它只列举了一次源序列，不管怎么回事，所以它应该合理有效。

Answer 2

您希望能够使用的代码不是LINQ，因此我不明白为什么需要使用LINQ实现它。

这与子字符串搜索基本相同（实际上，顺序有效的枚举是“字符串”的概括）。

由于计算机科学长期以来经常考虑这个问题，所以你要站在巨人的肩膀上。

一些合理的起点是：

http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

http://en.wikipedia.org/wiki/Rabin-karp

即使只是维基百科文章中的伪代码也足以轻松移植到C＃。查看不同情况下的性能描述，并确定代码最有可能遇到的情况。

Answer 3

我知道这是一个老问题，但我需要这个确切的方法，我就这样写了：

public static int ContainsSubsequence<T>(this IEnumerable<T> elements, IEnumerable<T> subSequence) where T: IEquatable<T>
{
    return ContainsSubsequence(elements, 0, subSequence);
}

private static int ContainsSubsequence<T>(IEnumerable<T> elements, int index, IEnumerable<T> subSequence) where T: IEquatable<T>
{
    // Do we have any elements left?
    bool elementsLeft = elements.Any();

    // Do we have any of the sub-sequence left?
    bool sequenceLeft = subSequence.Any();

    // No elements but sub-sequence not fully matched
    if (!elementsLeft && sequenceLeft)
        return -1; // Nope, didn't match

    // No elements of sub-sequence, which means even if there are
    // more elements, we matched the sub-sequence fully
    if (!sequenceLeft)
        return index - subSequence.Count(); // Matched!

    // If we didn't reach a terminal condition,
    // check the first element of the sub-sequence against the first element
    if (subSequence.First().Equals(e.First()))
        // Yes, it matched - move onto the next. Consume (skip) one element in each
        return ContainsSubsequence(elements.Skip(1), index + 1 subSequence.Skip(1));
    else
        // No, it didn't match. Try the next element, without consuming an element
        // from the sub-sequence
        return ContainsSubsequence(elements.Skip(1), index + 1, subSequence);
}

更新为不仅返回子序列是否匹配，而是以原始序列开始的位置。

这是IEnumerable上的一个扩展方法，完全是懒惰的，提前终止，并且比当前最高投票的答案更有效。但是，如同@ wai-ha-lee指出的那样，是递归的并且创建了一个很多的枚举器。在适用的地方使用它（性能/内存）。这对我的需求很好，但YMMV。

Answer 4

您可以使用名为Sequences的库来做到这一点（免责声明：我是作者）。

它有IndexOfSlice方法，可以完全满足您的需求 - 它是implementation的Knuth-Morris-Pratt algorithm。

int startIndex = largeSequence.AsSequence().IndexOfSlice(subSequence);

Answer 5

<强> 更新： 鉴于问题的澄清，我在下面的回答并不适用。将其留作历史用途。

您可能想要使用mySequence.Where（）。然后关键是优化谓词以在您的环境中正常运行。根据您的要求和典型使用模式，这可能会有很大差异。

很有可能适用于小型集合的内容很难适用于更大的集合，具体取决于T的类型。

当然，如果90％用于小型收藏品，那么优化异常值大型收藏品似乎有点YAGNI。

使用Linq </t>在IEnumerable <t>中查找序列

5 个答案: