访谈:机器编码/正则表达式(更好地替代我的解决方案)

时间:2014-04-21 17:21:31

标签: regex algorithm finite-automata

以下是访谈问题:

  

机器编码轮:(时间1小时)

     

表达式和字符串testCase,需要评估testCase是否对表达式有效

     

表达可能包含:

     
      
  • 字母[a-z]
  •   
  • '.''.'代表[a-z]
  • 中的所有字符   
  • '*''*'具有与普通RegExp相同的属性)
  •   
  • '^''^'表示字符串的开头)
  •   
  • '$''$'表示字符串的结尾)
  •   
     

示例案例:

Expression   Test Case   Valid
ab           ab          true 
a*b          aaaaaab     true 
a*b*c*       abc         true 
a*b*c        aaabccc     false 
^abc*b       abccccb     true 
^abc*b       abbccccb    false 
^abcd$       abcd        true 
^abc*abc$    abcabc      true 
^abc.abc$    abczabc     true 
^ab..*abc$   abyxxxxabc  true

我的方法:

  1. 将给定的正则表达式转换为连接(ab),更改(a|b),(a*)kleenstar。   并添加+进行连接 例如:

    abc$  =>  .*+a+b+c
    ^ab..*abc$  => a+b+.+.*+a+b+c
    
  2. 根据优先级转换为后缀表示法 (parantheses>kleen_star>concatenation>..

    (a|b)*+c  =>  ab|*c+
    
  3. 根据Thompson构建

  4. 构建NFA
  5. 通过维护一组状态来回溯/遍历NFA。

  6. 当我开始实施它时,花了我超过1小时。我觉得第3步非常耗时。我通过使用后缀表示法+堆栈以及根据需要添加新状态和转换来构建NFA。

    所以,我想知道这个问题是否有更快的替代解决方案?或者更快的方式来实现第3步。我发现this CareerCup link有人在评论中提到它是来自一些编程竞赛。所以,如果有人先前解决了这个问题,或者对这个问题有更好的解决方案,我会很高兴知道我哪里出错了。

4 个答案:

答案 0 :(得分:4)

我想到了Levenshtein distance的一些推导 - 可能不是最快的算法,但应该快速实现。

我们可以在开始时忽略^,在结尾处忽略$ - 其他任何地方都无效。

然后我们构造一个2D网格,其中每一行代表表达式中的单位 [1] ,每列代表测试字符串中的一个字符。

[1]:这里的“单位”是指单个字符,但*必须附加到前一个字符

因此对于a*b*caaabccc,我们会得到类似的内容:

   a a a b c c c
a*
b*
c

每个单元格都可以有一个表示有效性的布尔值。

现在,对于每个单元格,如果其中任何一个成立,则将其设置为有效:

  • 左邻居中的值有效,行为x*.*,列为xx为任意字符{{1 }})

    这对应于a-z匹配一个额外字符。

  • 左上角邻​​居的值有效,行为*x,列为.x为任意字符{ {1}})

    这对应于单字符匹配。

  • 顶级邻居中的值有效,行为xa-z

    这对应于x*无匹配。

然后检查最右下角的单元格是否有效。

因此,对于上面的例子,我们得到:(.*表示有效)

*

由于右下角的单元格无效,我们将返回无效。

运行时间:V


你应该注意到我们主要是探索网格的一小部分。

这个解决方案可以通过使用memoization作为递归解决方案来改进(并且只是调用右下角单元的递归解决方案)。

这将为我们提供 a a a b c c c a* V V V - - - - b* - - - V - - - c - - - - V - - 的最佳效果,但仍然是O(stringLength*expressionLength)的最差情况。


我的解决方案假定表达式必须与整个字符串匹配,因为上述示例的结果推断无效(根据问题)。

如果它可以匹配子字符串,我们可以稍微修改一下,如果单元格位于顶行,则它在以下情况下有效:

  • 该行为O(1)O(stringLength*expressionLength)

  • 行为x*.*,列为x

答案 1 :(得分:1)

只需1小时,我们就可以使用简单的方式。

将模式拆分为令牌:a*b.c => { a* b . c }

如果模式不是以^开头,则在开头添加.*,否则删除^

如果模式没有以$结尾,那么最后添加.*,否则删除$

然后我们使用递归:如果我们有重复模式(将模式索引增加1,将字索引增加1,将两个索引增加1),如果它不是重复模式(增加两者),则使用3路指数由1)。

C#中的示例代码

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

namespace ReTest
{
    class Program
    {
        static void Main(string[] args)
        {
            Debug.Assert(IsMatch("ab", "ab") == true);
            Debug.Assert(IsMatch("aaaaaab", "a*b") == true);
            Debug.Assert(IsMatch("abc", "a*b*c*") == true);
            Debug.Assert(IsMatch("aaabccc", "a*b*c") == true); /* original false, but it should be true */
            Debug.Assert(IsMatch("abccccb", "^abc*b") == true);
            Debug.Assert(IsMatch("abbccccb", "^abc*b") == false);
            Debug.Assert(IsMatch("abcd", "^abcd$") == true);
            Debug.Assert(IsMatch("abcabc", "^abc*abc$") == true);
            Debug.Assert(IsMatch("abczabc", "^abc.abc$") == true);
            Debug.Assert(IsMatch("abyxxxxabc", "^ab..*abc$") == true);
        }

        static bool IsMatch(string input, string pattern)
        {
            List<PatternToken> patternTokens = new List<PatternToken>();
            for (int i = 0; i < pattern.Length; i++)
            {
                char token = pattern[i];
                if (token == '^')
                {
                    if (i == 0)
                        patternTokens.Add(new PatternToken { Token = token, Occurence = Occurence.Single });
                    else
                        throw new ArgumentException("input");
                }
                else if (char.IsLower(token) || token == '.')
                {
                    if (i < pattern.Length - 1 && pattern[i + 1] == '*')
                    {
                        patternTokens.Add(new PatternToken { Token = token, Occurence = Occurence.Multiple });
                        i++;
                    }
                    else
                        patternTokens.Add(new PatternToken { Token = token, Occurence = Occurence.Single });
                }
                else if (token == '$')
                {
                    if (i == pattern.Length - 1)
                        patternTokens.Add(new PatternToken { Token = token, Occurence = Occurence.Single });
                    else
                        throw new ArgumentException("input");
                }
                else
                    throw new ArgumentException("input");
            }

            PatternToken firstPatternToken = patternTokens.First();
            if (firstPatternToken.Token == '^')
                patternTokens.RemoveAt(0);
            else
                patternTokens.Insert(0, new PatternToken { Token = '.', Occurence = Occurence.Multiple });

            PatternToken lastPatternToken = patternTokens.Last();
            if (lastPatternToken.Token == '$')
                patternTokens.RemoveAt(patternTokens.Count - 1);
            else
                patternTokens.Add(new PatternToken { Token = '.', Occurence = Occurence.Multiple });

            return IsMatch(input, 0, patternTokens, 0);
        }

        static bool IsMatch(string input, int inputIndex, IList<PatternToken> pattern, int patternIndex)
        {
            if (inputIndex == input.Length)
            {
                if (patternIndex == pattern.Count || (patternIndex == pattern.Count - 1 && pattern[patternIndex].Occurence == Occurence.Multiple))
                    return true;
                else
                    return false;
            }
            else if (inputIndex < input.Length && patternIndex < pattern.Count)
            {
                char c = input[inputIndex];
                PatternToken patternToken = pattern[patternIndex];
                if (patternToken.Token == '.' || patternToken.Token == c)
                {
                    if (patternToken.Occurence == Occurence.Single)
                        return IsMatch(input, inputIndex + 1, pattern, patternIndex + 1);
                    else
                        return IsMatch(input, inputIndex, pattern, patternIndex + 1) ||
                               IsMatch(input, inputIndex + 1, pattern, patternIndex) ||
                               IsMatch(input, inputIndex + 1, pattern, patternIndex + 1);
                }
                else
                    return false;
            }
            else
                return false;
        }

        class PatternToken
        {
            public char Token { get; set; }
            public Occurence Occurence { get; set; }

            public override string ToString()
            {
                if (Occurence == Occurence.Single)
                    return Token.ToString();
                else
                    return Token.ToString() + "*";
            }
        }

        enum Occurence
        {
            Single,
            Multiple
        }
    }
}

答案 2 :(得分:0)

这是Java中的解决方案。空间和时间是O(n)。提供内联注释以便更清晰:

/**
 * @author Santhosh Kumar
 *
 */
public class ExpressionProblemSolution {

public static void main(String[] args) {
    System.out.println("---------- ExpressionProblemSolution - start ---------- \n");
    ExpressionProblemSolution evs = new ExpressionProblemSolution();
    evs.runMatchTests();
    System.out.println("\n---------- ExpressionProblemSolution - end ---------- ");
}

// simple node structure to keep expression terms
class Node {
    Character ch; // char [a-z]
    Character sch; // special char (^, *, $, .)
    Node next;

    Node(Character ch1, Character sch1) {
        ch = ch1;
        sch = sch1;
    }

    Node add(Character ch1, Character sch1) {
        this.next = new Node(ch1, sch1);
        return this.next;
    }

    Node next() {
        return this.next;
    }

    public String toString() {
        return "[ch=" + ch + ", sch=" + sch + "]";
    }
}

private boolean letters(char ch) {
    return (ch >= 'a' && ch <= 'z');
}

private boolean specialChars(char ch) {
    return (ch == '.' || ch == '^' || ch == '*' || ch == '$');
}

private void validate(String expression) {
    // if expression has invalid chars throw runtime exception
    if (expression == null) {
        throw new RuntimeException(
                "Expression can't be null, but it can be empty");
    }
    char[] expr = expression.toCharArray();
    for (int i = 0; i < expr.length; i++) {
        if (!letters(expr[i]) && !specialChars(expr[i])) {
            throw new RuntimeException(
                    "Expression contains invalid char at position=" + i
                            + ", invalid_char=" + expr[i]
                            + " (allowed chars are 'a-z', *, . ^, * and $)");
        }
    }
}

// Parse the expression and split them into terms and add to list
// the list is FSM (Finite State Machine). The list is used during
// the process step to iterate through the machine states based 
// on the input string
// 
// expression = a*b*c has 3 terms -> [a*] [b*] [c] 
// expression = ^ab.*c$ has 4 terms -> [^a] [b] [.*] [c$]   
//
// Timing : O(n)    n -> expression length
// Space :  O(n)    n -> expression length decides the no.of terms stored in the list
private Node preprocess(String expression) {
    debug("preprocess - start [" + expression + "]");
    validate(expression);
    Node root = new Node(' ', ' '); // root node with empty values
    Node current = root;
    char[] expr = expression.toCharArray();
    int i = 0, n = expr.length;

    while (i < n) {
        debug("i=" + i);
        if (expr[i] == '^') { // it is prefix operator, so it always linked
                                // to the char after that
            if (i + 1 < n) {
                if (i == 0) { // ^ indicates start of the expression, so it
                                // must be first in the expr string
                    current = current.add(expr[i + 1], expr[i]);
                    i += 2;
                    continue;
                } else {
                    throw new RuntimeException(
                            "Special char ^ should be present only at the first position of the expression (position="
                                    + i + ", char=" + expr[i] + ")");
                }
            } else {
                throw new RuntimeException(
                        "Expression missing after ^ (position=" + i
                                + ", char=" + expr[i] + ")");
            }
        } else if (letters(expr[i]) || expr[i] == '.') { // [a-z] or .
            if (i + 1 < n) {
                char nextCh = expr[i + 1];
                if (nextCh == '$' && i + 1 != n - 1) { // if $, then it must
                                                        // be at the last
                                                        // position of the
                                                        // expression
                    throw new RuntimeException(
                            "Special char $ should be present only at the last position of the expression (position="
                                    + (i + 1)
                                    + ", char="
                                    + expr[i + 1]
                                    + ")");
                }
                if (nextCh == '$' || nextCh == '*') { // a* or b$
                    current = current.add(expr[i], nextCh);
                    i += 2;
                    continue;
                } else {
                    current = current.add(expr[i], expr[i] == '.' ? expr[i]
                            : null);
                    i++;
                    continue;
                }
            } else { // a or b
                current = current.add(expr[i], null);
                i++;
                continue;
            }
        } else {
            throw new RuntimeException("Invalid char - (position=" + (i)
                    + ", char=" + expr[i] + ")");
        }
    }

    debug("preprocess - end");
    return root;
}

// Traverse over the terms in the list and iterate and match the input string
// The terms list is the FSM (Finite State Machine); the end of list indicates
// end state. That is, input is valid and matching the expression
//
// Timing : O(n) for pre-processing + O(n) for processing = 2O(n) = ~O(n) where n -> expression length
// Timing : O(2n) ~ O(n)
// Space :  O(n)    where n -> expression length decides the no.of terms stored in the list
public boolean process(String expression, String testString) {
    Node root = preprocess(expression);
    print(root);
    Node current = root.next();
    if (root == null || current == null)
        return false;
    int i = 0;
    int n = testString.length();
    debug("input-string-length=" + n);
    char[] test = testString.toCharArray();
    // while (i < n && current != null) {
    while (current != null) {
        debug("process: i=" + i);
        debug("process: ch=" + current.ch + ", sch=" + current.sch);
        if (current.sch == null) { // no special char just [a-z] case
            if (test[i] != current.ch) { // test char and current state char
                                            // should match
                return false;
            } else {
                i++;
                current = current.next();
                continue;
            }
        } else if (current.sch == '^') { // process start char
            if (i == 0 && test[i] == current.ch) {
                i++;
                current = current.next();
                continue;
            } else {
                return false;
            }

        } else if (current.sch == '$') { // process end char
            if (i == n - 1 && test[i] == current.ch) {
                i++;
                current = current.next();
                continue;
            } else {
                return false;
            }

        } else if (current.sch == '*') { // process repeat char
            if (letters(current.ch)) { // like a* or b*
                while (i < n && test[i] == current.ch)
                    i++; // move i till end of repeat char
                current = current.next();
                continue;
            } else if (current.ch == '.') { // like .*
                Node nextNode = current.next();
                print(nextNode);
                if (nextNode != null) {
                    Character nextChar = nextNode.ch;
                    Character nextSChar = nextNode.sch;
                    // a.*z = az or (you need to check the next state in the
                    // list)
                    if (test[i] == nextChar) { // test [i] == 'z'
                        i++;
                        current = current.next();
                        continue;
                    } else {
                        // a.*z = abz or
                        // a.*z = abbz
                        char tch = test[i]; // get 'b'
                        while (i + 1 < n && test[++i] == tch)
                            ; // move i till end of repeat char
                        current = current.next();
                        continue;
                    }
                }
            } else { // like $* or ^*
                debug("process: return false-1");
                return false;
            }

        } else if (current.sch == '.') { // process any char
            if (!letters(test[i])) {
                return false;
            }
            i++;
            current = current.next();
            continue;
        }
    }

    if (i == n && current == null) {
        // string position is out of bound
        // list is at end ie. exhausted both expression and input
        // FSM reached the end state, hence the input is valid and matches the given expression 
        return true;
    } else {
        return false;
    }
}

public void debug(Object str) {
    boolean debug = false;
    if (debug) {
        System.out.println("[debug] " + str);
    }
}

private void print(Node node) {
    StringBuilder sb = new StringBuilder();
    while (node != null) {
        sb.append(node + " ");
        node = node.next();
    }
    sb.append("\n");
    debug(sb.toString());
}

public boolean match(String expr, String input) {
    boolean result = process(expr, input);
    System.out.printf("\n%-20s %-20s %-20s\n", expr, input, result);
    return result;
}

public void runMatchTests() {
    match("ab", "ab");
    match("a*b", "aaaaaab");
    match("a*b*c*", "abc");
    match("a*b*c", "aaabccc");
    match("^abc*b", "abccccb");
    match("^abc*b", "abccccbb");
    match("^abcd$", "abcd");
    match("^abc*abc$", "abcabc");
    match("^abc.abc$", "abczabc");
    match("^ab..*abc$", "abyxxxxabc");
    match("a*b*", ""); // handles empty input string
    match("xyza*b*", "xyz");
}}

答案 3 :(得分:0)

 int regex_validate(char *reg, char *test) {
        char *ptr = reg;

        while (*test) {
                switch(*ptr) {
                        case '.':
                        {
                                test++; ptr++; continue;
                                break;
                        }
                        case '*':
                        {
                                if (*(ptr-1) == *test) {
                                        test++; continue;
                                }
                                else if (*(ptr-1) == '.' && (*test == *(test-1))) {
                                        test++; continue;
                                }
                                else {
                                        ptr++; continue;
                                }
                                break;
                        }
                      case '^':
                        {
                                ptr++;

                                while ( ptr && test && *ptr == *test) {
                                        ptr++; test++;
                                }
                                if (!ptr && !test)
                                        return 1;
                                if (ptr && test && (*ptr == '$' || *ptr == '*' || *ptr == '.')) {
                                         continue;
                                }
                                else {
                                        return 0;
                                }
                                break;
                        }
                        case '$':
                        {
                                if (*test)
                                        return 0;
                                break;
                        }
                        default:
                        {
                                printf("default case.\n");
                                if (*ptr != *test) {
                                        return 0;
                                }
                                test++; ptr++; continue;
                        }
                        break;
                }
        }
        return 1;
}

int main () {
        printf("regex=%d\n", regex_validate("ab", "ab"));
        printf("regex=%d\n", regex_validate("a*b", "aaaaaab"));
        printf("regex=%d\n", regex_validate("^abc.abc$", "abcdabc"));
        printf("regex=%d\n", regex_validate("^abc*abc$", "abcabc"));
        printf("regex=%d\n", regex_validate("^abc*b", "abccccb"));
        printf("regex=%d\n", regex_validate("^abc*b", "abbccccb"));
        return 0;
}