Question

我正在用C＃构建一个自然语言处理器，我们数据库中的许多“单词”实际上是多个单词短语，指的是一个名词或动作。请不要讨论这个设计电话，这足以说它此时不可更改。我有句子的相关单词（块）的字符串数组，我需要测试这些短语和单词。 处理子数组提取的适当惯用方法是什么，所以我运行溢出错误等风险最小？

为了给出所需逻辑的示例，让我逐步执行带有示例块的运行。出于我们的目的，假设数据库中唯一的多字短语是“快速褐色”。

Full phrase: The quick brown fox -> encoded as {"The", "quick", "brown", "fox"}
First iteration: Test "The quick brown fox" -> returns nothing
Second iteration: Test "The quick brown" -> returns nothing
Third iteration: Test "The quick" -> returns nothing
Fourth iteration: Test "The" -> returns value
Fifth iteration: Test "quick brown fox" -> returns nothing
Sixth iteration: Test "quick brown" -> returns value
Seventh iteration: Test "fox" -> returns value

Sum all returned values and return.

我有一些关于如何解决这个问题的想法，但是我越看得越多，我就越担心阵列寻址错误和其他困扰我的代码的恐怖事件。这个短语是以字符串数组形式出现的，但我可以把它放到IEnumerable中。我唯一关心的是Enumerable缺乏索引。

Answer 1

这听起来像是Aho-Corasick字符串匹配算法的完美应用。我有一个大约1000万个短语的字典，我通过短语串。它的速度非常快。只需一次通过，它就会告诉您所有匹配的短语。因此，如果“the”，“fox”和“quick brown”都在字典中，单个传递将返回所有三个索引。

这很容易实现。在线查找原始论文，您可以在下午制作它。

Efficient String Matching: An Aid to Bibliographic Search

Answer 2

ArraySegment或DelimitedArray会有帮助吗？

Answer 3

这样的事情怎么样：

    string[] words = new string[] { "The", "quick", "brown", "fox" };

    for (int start = 0; start < words.Length - 2; start++) // at least one word
    {
        for (int end = start + 1; end < words.Length - 1; end++)
        {
            ArraySegment<string> segment = new ArraySegment<string>(words, start, end - start);
            // test segment
        }
    }

这假设您可以使用ArraySegment段进行测试。

Answer 4

前进的道路在于结合马克和菲利普的答案。在理想的情况下，我会用它编辑他们的一个帖子，但好像我的编辑被拒绝了。

无论如何，我接受了Mark链接的DelimitedArray并改变了一些内容：

构造函数已更改为：

    public DelimitedArray(T[] array, int offset, int count, bool throwErrors = false)
    {
        this.array = array;
        this.offset = offset;
        this.count = count;
        this.throwErrors = throwErrors;
    }

索引引用更改为：

public T this[int index]
    {
        get
        {
            int idx = this.offset + index;
            if (idx > this.Count - 1 || idx < 0)
            {
                if (throwErrors == true)
                    throw new IndexOutOfRangeException("Index '" + idx + "' was outside the bounds of the array.");
                return default(T);
            }
            return this.array[idx];
        }
    }

然后我将其用于Philipp的循环使用。这变为：

        for (var start = 0; start < words.Length - 2; start++) // at least one word
        {
            for (var end = start + 1; end < words.Length - 1; end++)
            {
                var segment = new DelimitedArray<string>(words, start, end - start);
                lemma = string.Join(" ", segment.GetEnumerator()); // get the word/phrase to test
                result = this.DoTheTest(lemma);

                if (result > 0)
                {
                    // Add the new result
                    ret = ret + result;

                    // Move the start sentinel up, mindful of the +1 that will happen at the end of the loop
                    start = start + segment.Count - 1;
                    // And instantly finish the end sentinel; we're done here.
                    end = words.Length;
                }
            }
        }

如果我能接受不止一个答案，我会标记他们的答案，但由于两者都不完整，所以我明天就能接受自己的答案。

在C＃中安全地/惯用地提取子数组

4 个答案: