Question

我正在通过从正则表达式生成NFA，然后从NFA生成DFA，从头开始实现正则表达式解析器。问题是DFA只能在计算接受时说。如果正则表达式为“ n *”，且匹配的字符串为“不能”，则DFA看到c后将进入失败状态，因此我从前面放了第一个字符，“ annot”然后是“ nnot”。此时，它看到n并进入最终状态，并且只返回单个n，所以我告诉它继续尝试，直到下一个字符将其从最终状态中删除为止。但是，当它完成时，它将再次删除第一个字符，因此它将是“ not”，并且将与“ n”匹配，但是我不希望后续的匹配，我只想要“ nn”。我不知道这怎么可能。

Answer 1

这是一个简单但可能不是最佳算法。我们通过从该点开始运行DFA，尝试在字符串的每个连续点处进行锚定匹配。在运行DFA时，我们会在字符串中记录DFA处于接受状态的最后一点。当我们最终到达字符串的末尾或DFA无法再前进的点时，如果我们经过接受状态，则可以返回匹配项；换句话说，如果我们保存了接受位置，那将是比赛的结束。否则，我们将退回到下一个起始位置并继续。

（注意：在下面的两种伪代码算法中，假设保存字符串索引的变量可以具有Undefined值。在实际的实现中，该值可以为-1 ）

使用伪代码：

Set <start> to 0
Repeat A:
     Initialise the DFA state the starting state.
     Set <scan> to <start>
     Set <accepted> to Undefined
     Repeat B:
        If there is a character at <scan> and
        the DFA has a transition on that character:
            Advance the DFA to the indicated next state
            Increment <scan>
            If the DFA is now in an accepting state, set <accepted> to <scan>
            Continue Loop B
        Otherwise, the DFA cannot advance:
            If <accepted> is still Undefined:
                Increment <start> and continue Loop A
            Otherwise, <accepted> has a value:
                Return a match from <scan> to <accepted> (semi-inclusive)

上述算法的问题在于，循环B在失败并回溯到下一个起始位置之前可以执行任意次数。因此，在最坏的情况下，字符串长度的搜索时间将是平方。例如，使用模式a*b和由大量a组成的字符串将发生这种情况。

另一种方法是并行运行多个DFA。每个DFA对应于图案中的不同起点。我们线性扫描字符串；在每个位置，我们都可以产生一个与该位置相对应的新DFA，其状态为初始状态。

请务必注意，并非每个起点都有一个DFA，因为不必将两个DFA保持相同的状态。由于搜索是针对字符串中的第一个匹配项，因此，如果两个DFA共享相同的状态，则只有开头较早的那个才是合理的匹配项。此外，一旦某些DFA达到接受状态，就不再需要保留任何起始点较晚的DFA，这意味着一旦任何DFA达到接受状态，我们就不再在扫描中添加新的DFA。

由于活动DFA的数量最多是DFA中的状态数，因此此算法在O（NM）中运行，其中N是字符串的长度，M是DFA中的状态数。实际上，活动DFA的数量通常少于状态的数量（除非有很少的状态）。

尽管如此，病理性最坏情况仍然存在，因为NFA⇒DFA转换可以成倍增加状态数。通过使用NFA而不是DFA的集合，可以避免指数爆炸。通过使用无ε的NFA简化NFA转换非常方便，方法是在Thompson自动机上进行ε闭包或构建Glushkov自动机。使用Glushkov自动机可以确保状态数不超过图案的长度。

使用伪代码：

Initialise a vector <v> of <index, state> pairs. Initially the vector
is empty, and its maximum size is the number of states. This vector is
always kept in increasing order by index.

Initialise another vector <active> of string indices, one for each state.
Initially all the elements in <active> are Undefined. This vector records
the most recent index at which some Automaton was in each state.

Initialise <match> to a pair of index positions, both undefined. This
will record the best match found so far.

For each position <scan> in the string:
    If <match> has not yet been set:
        Append {<scan>, <start_state>} to <v>.
    If <v> is empty:
        Return <match> if it has been set, or otherwise
        return a failure indication.
    Copy <v> to <v'> and empty <v>. (It's possible to recycle <v>,
    but it's easier to write the pseudocode with a copy.) 
    For each pair {<i>, <q>} in <v'>:
        If <i> is greater than the starting index in <match>:
            Terminate this loop and continue with the outer loop.
        If there is no transition from state <q> on the symbol at <scan>:
            Continue with the next pair.
        Otherwise, there is a transition to <q'> (if using NFAs, do this for each transition):
            If the index in <active> corresponding to <q'> has already
            been set to <scan>:
                Continue with the next pair.
            Otherwise, <q'> is not yet in <v>:
                Append the pair {<i>, <q'>} at the end of <v>.
                Set the the index in <active> at <q'> to <scan>.
                If <q'> is an accepting state:
                     Set <match> to {<i>, <scan>}.

如何解析给定DFA

1 个答案: