将正则表达式表示为无上下文语法

时间:2016-01-07 01:16:37

标签: regex parsing syntax language-agnostic ocaml

我正在为一个简单的正则表达式引擎编写一个解析器。

引擎支持a .. z | *以及连接和括号

这是我制作的CFG:

 exp = concat factor1
 factor1 = "|" exp | e
 concat = term factor2
 factor2 = concat | e
 term = element factor3
 factor3 = * | e
 element = (exp) | a .. z

等于

 S = T X
 X = "|" S | E
 T = F Y 
 Y = T | E
 F = U Z
 Z = *| E
 U = (S) | a .. z

对于交替和关闭,我可以通过向前看并基于令牌选择生产来轻松处理它们。但是,没有办法通过向前看处理串联,因为它是隐式的。

我想知道如何处理连接或我的语法有什么问题?

这是我用于解析的OCaml代码:

type regex = 
  | Closure of regex
  | Char of char
  | Concatenation of regex * regex
  | Alternation of regex * regex
  (*| Epsilon*)


exception IllegalExpression of string

type token = 
  | End
  | Alphabet of char
  | Star
  | LParen
  | RParen
  | Pipe

let rec parse_S (l : token list) : (regex * token list) = 
  let (a1, l1) = parse_T l in
  let (t, rest) = lookahead l1 in 
  match t with
  | Pipe ->                                   
      let (a2, l2) = parse_S rest in
      (Alternation (a1, a2), l2)
  | _ -> (a1, l1)                             

and parse_T (l : token list) : (regex * token list) = 
  let (a1, l1) = parse_F l in
  let (t, rest) = lookahead l1 in 
  match t with
  | Alphabet c -> (Concatenation (a1, Char c), rest)
  | LParen -> 
     (let (a, l1) = parse_S rest in
      let (t1, l2) = lookahead l1 in
      match t1 with
      | RParen -> (Concatenation (a1, a), l2)
      | _ -> raise (IllegalExpression "Unbalanced parentheses"))
  | _ -> 
      let (a2, rest) = parse_T l1 in
      (Concatenation (a1, a2), rest)


and parse_F (l : token list) : (regex * token list) = 
  let (a1, l1) = parse_U l in 
  let (t, rest) = lookahead l1 in 
  match t with
  | Star -> (Closure a1, rest)
  | _ -> (a1, l1)

and parse_U (l : token list) : (regex * token list) = 
  let (t, rest) = lookahead l in
  match t with
  | Alphabet c -> (Char c, rest)
  | LParen -> 
     (let (a, l1) = parse_S rest in
      let (t1, l2) = lookahead l1 in
      match t1 with
      | RParen -> (a, l2)
      | _ -> raise (IllegalExpression "Unbalanced parentheses"))
  | _ -> raise (IllegalExpression "Unknown token")

1 个答案:

答案 0 :(得分:0)

对于LL语法,FIRST集是允许作为规则的第一个令牌的令牌。可以迭代地构造它们,直到达到一个固定的点。

  1. 以令牌开头的规则在其FIRST集中具有该令牌
  2. 以术语开头的规则在其第一组中具有该术语的第一组
  3. 规则T = A | B具有FIRST(A)和FIRST(B)的并集作为第一组
  4. 从步骤1开始,然后重复步骤2和3,直到第一组达到固定点(不要改变)。现在你有了真正的FIRST语法集,可以使用前瞻来决定每个规则。

    注意:在您的代码中,parse_T函数与FIRST(T)集不匹配。如果你看一下例如' a | b'然后输入parse_T和' a'与parse_F调用匹配。那么先行就是' |'它与您的语法中的epsilon匹配,但不会与您的代码匹配。