在解析HTSQL时处理语法歧义

时间:2016-05-22 09:52:54

标签: python parsing common-lisp grammar htsql

我正在编写一个语法来解析HTSQL语法,并坚持如何处理段和除法运算符的/字符的重用。 described grammar并不是非常正式的,所以我一直在关注Python实现的确切输出,从粗略的一瞥似乎是一个手写的解析器,而不是使用解析器生成器 - 作为参考解析器生成器我目前使用的是CL-YACC CL-LEX。 (FWIW完整的事情is here,虽然可能有点过时。)

由于"/1"被解析为'(:COLLECT (:INTEGER "1"))',但由于"/1/2"被解析为'(:COLLECT (:OPERATOR / (:INTEGER "1") (:INTEGER "2")))'而出现了一个我正在努力解决的歧义,即一个是段分隔符,另一个是分裂; "/1//2"再次为'(:COLLECT (:OPERATOR / (:INTEGER "1") (:COLLECT (:INTEGER "2"))))'

问题是,如何在不借助切换到手动解析器的情况下在语法规范中处理这个问题?切换到不同的解析器生成器类(而不是LALR(1))会有帮助吗?

到目前为止,我已经尝试了不同的语法变体,但是对于整个语法来说,优先级是固定的这一事实也会干扰斜杠的两种解释。我尝试的另一种方法是在词法分析器中消除歧义,即以不同方式处理第一个斜杠(在每个“组”中)并返回不同的符号,例如, DIV - 然而,我找不到一个好的规则,并以某种方式怀疑它纯粹是通过查看词汇结构而存在。

最后,我很想通过完全脱离给定的解析器来解决这个问题,以使我的生活更轻松。从某种意义上说,更容易理解可预测的语法,你会认为这更令人满意吗?

图1,用于检查解析树的Python脚本:

import htsql


application = htsql.HTSQL("sqlite:///htsql_demo.sqlite")


global y
y = None


def p(string):
    global y
    with application:
        y = htsql.core.syn.parse.parse(string)
        return y


def l(name):
    result = []
    for c in name:
        if c.isupper() and result:
            result.append("-")
        result.append(c)
    return "".join(result)


def keyword(name):
    return ":{}".format(name.upper())


def n(expression):
    name = expression.__class__.__name__
    name = name[:name.find("Syntax")]
    return keyword(l(name))


def t(expression):
    arguments = [n(expression)]
    d = expression.__dict__
    if "identifier" in d:
        arguments.append(t(expression.identifier))
    if "text" in d:
        arguments.append("\"{}\"".format(expression.text))
    if "symbol" in d:
        if not isinstance(expression, (htsql.core.syn.syntax.ProjectSyntax, htsql.core.syn.syntax.FilterSyntax, htsql.core.syn.syntax.CollectSyntax, htsql.core.syn.syntax.DetachSyntax)):
            arguments.append(expression.symbol)
    if "arm" in d:
        arguments.append(t(expression.arm))
    if "larm" in d:
        arguments.append(t(expression.larm))
    if "rarm" in d:
        arguments.append(t(expression.rarm))
    if "arms" in d:
        arguments.extend(t(x) for x in expression.arms)
    if "rarms" in d:
        arguments.extend(t(x) for x in expression.rarms)
    return "({})".format(" ".join(arguments))


# t(p("/school"))
# '(:COLLECT (:IDENTIFIER "school"))

# t(p("/'school'"))
# '(:COLLECT (:STRING "school"))

图2,我当前的解析器,没有正确处理这个问题:

(defpackage #:cl-htsql
  (:use #:cl #:alexandria #:cl-lex #:yacc)
  (:import-from #:arnesi #:with-collector))

(eval-when (:compile-toplevel :load-toplevel :execute)
  (defun maybe-intern (name &optional (package NIL package-p))
    "If NAME is a SYMBOL, return it, otherwise INTERN it."
    (cond
      ((symbolp name)
       name)
      (package-p
       (intern name package))
      (T
       (intern name))))

  (defmacro define-lexer (name &body patterns)
    "Shortcut for DEFINE-STRING-LEXER."
    `(define-string-lexer ,name
       ,@(mapcar
          (lambda (pattern)
            (etypecase pattern
              ((or symbol string)
               (let ((symbol (maybe-intern pattern))
                     (pattern (string pattern)))
                 `(,pattern (return (values ',symbol ',symbol)))))
              (list
               (destructuring-bind (pattern &optional symbol value) pattern
                 (let* ((symbol (or symbol (intern pattern)))
                        (value (or value symbol)))
                   (etypecase symbol
                     (list
                      `(,pattern ,symbol))
                     (symbol
                      `(,pattern (return (values ',symbol ',value))))))))))
          patterns))))

;; parser are results are to be treated immutable
(define-lexer string-lexer
  /
  ("\\|" \|)
  ("\\&" &)
  <=
  >=
  ==
  =
  !==
  !=
  !~
  !
  ~
  <
  >
  @
  ("\\?" ?)
  ("\\." \.)
  ("\\(" \()
  ("\\)" \))
  ("\\+" +)
  -
  ("\\*" *)
  \:
  ("-?0|[1-9][0-9]*(\\.[0-9]*)?([eE][+-]?[0-9]+)?"
   (return (cond
             ((find #\e $@)
              (values 'float $@))
             ((find #\. $@)
              (values 'decimal $@))
             (T
              (values 'integer $@)))))
  ("([^\"\\.\\?~\'=<>\\(\\)@\\|\\&/:])+" (return (values 'name $@)))
  ("\'([^\\\']|\\.)*?\'" (return (values 'string (string-trim "\'" $@))))
  ("\"([^\\\"]|\\.)*?\"" (return (values 'string (string-trim "\"" $@)))))

(define-parser *expression-parser*
  (:muffle-conflicts (44 0))
  (:start-symbol query)
  (:terminals (|\|| #+(or)div & ! |.| ? / = != !== !~ ~ < > == <= >= \( \) + - * @ name integer decimal float string))
  (:precedence ((:left @) (:left ~) (:left |.|) (:left + -) (:left * div) (:left = != == !== ~ !~ < <= > >=) (:left !) (:left &) (:left |\||) (:left ?) (:left /)))

  (query
   segment)

  (segment
   (/ segment (lambda (x y) (declare (ignore x)) (if (eq y :skip) '(:skip) `(:collect ,y))))
   skip
   group)

  (skip
   ((constantly :skip)))

  (group
   (\( segment \) (lambda (x y z) (declare (ignore x z)) `(:group ,y)))
   sieve)

  (sieve
   (segment ? segment (lambda (x y z) (declare (ignore y)) `(:filter ,x ,z)))
   or)

  (or
   (segment |\|| segment (lambda (x y z) `(:operator ,y ,x ,z)))
   and)

  (and
   (segment & segment (lambda (x y z) `(:operator ,y ,x ,z)))
   not)

  (not
   (! segment (lambda (x y) `(:prefix ,x ,y)))
   comparison)

  (comparison
   (segment = segment (lambda (x y z) `(:operator ,y ,x ,z)))
   (segment != segment (lambda (x y z) `(:operator ,y ,x ,z)))
   (segment ~ segment (lambda (x y z) `(:operator ,y ,x ,z)))
   (segment !~ segment (lambda (x y z) `(:operator ,y ,x ,z)))
   (segment == segment (lambda (x y z) `(:operator ,y ,x ,z)))
   (segment !== segment (lambda (x y z) `(:operator ,y ,x ,z)))
   (segment < segment (lambda (x y z) `(:operator ,y ,x ,z)))
   (segment <= segment (lambda (x y z) `(:operator ,y ,x ,z)))
   (segment > segment (lambda (x y z) `(:operator ,y ,x ,z)))
   (segment >= segment (lambda (x y z) `(:operator ,y ,x ,z)))
   addition)

  (addition
   (segment + segment (lambda (x y z) `(:operator ,y ,x ,z)))
   (segment - segment (lambda (x y z) `(:operator ,y ,x ,z)))
   multiplication)

  (multiplication
   (segment * segment (lambda (x y z) `(:operator ,y ,x ,z)))
   (segment / segment (lambda (x y z) (declare (ignore y)) `(:operator / ,x ,z)))
   composition)

  (composition
   (segment |.| segment (lambda (x y z) (declare (ignore y)) `(:compose ,x ,z)))
   attach)

  (attach
   (segment @ segment (lambda (x y z) (declare (ignore y)) `(:attach ,x ,z)))
   detach)

  (detach
   (@ segment (lambda (x y) (declare (ignore x)) `(:detach ,y)))
   term)

  (term
   (name (lambda (x) `(:identifier ,x)))
   (string (lambda (x) `(:string ,x)))
   (number (lambda (x) `(:integer ,x)))
   (integer (lambda (x) `(:integer ,x)))
   (decimal (lambda (x) `(:decimal ,x)))
   (float (lambda (x) `(:float ,x)))))

(defun make-lexer-for-source (source)
  "Make a lexer for the SOURCE, either a STRING or a STREAM."
  (etypecase source
    (string (string-lexer source))
    (stream
     (flet ((ignore (c)
              (declare (ignore c))))
       (stream-lexer #'read-line #'string-lexer #'ignore #'ignore)))))

(defun lex-source (source)
  "Debug helper to lex a SOURCE into a list of tokens."
  (let ((lexer (make-lexer-for-source source)))
    (loop
      for (x y) = (multiple-value-list (funcall lexer))
      while x
      collect (list x y))))

(define-condition htsql-parse-error (simple-error) ())

(defun translate-yacc-error (error)
  (make-condition
   'htsql-parse-error
   :format-control "Couldn't parse HTSQL query: ~A."
   :format-arguments (list error)))

(defun parse-htsql-query (source)
  "Parse SOURCE into a syntax tree.  The SOURCE may be either a STRING or
a STREAM."
  (handler-case
      (parse-with-lexer
       (make-lexer-for-source source)
       *expression-parser*)
    (yacc-parse-error (error)
      (error (translate-yacc-error error)))))

;; > (parse-htsql-query "/1/")
;; (:OPERATOR / (:COLLECT (:INTEGER "1")) :SKIP)
;; > (parse-htsql-query "/1/2")
;; (:OPERATOR / (:COLLECT (:INTEGER "1")) (:INTEGER "2"))

1 个答案:

答案 0 :(得分:3)

如果查看运算符列表,您将看到另一种情况,其中相同的符号用作二进制和一元运算符,具有不同的优先级:一元减号。这在表达式语言中非常正常,yacc和大多数yacc派生数提供了一种解决方案:生产的显式优先级声明。在yacc中,您可以写

%token UNARY_MINUS UNARY_SLASH

%left UNARY_SLASH
%left '-' '+'
%left '/' '*'
%left UNARY_MINUS
%%
expr: '/' expr %prec UNARY_SLASH
    | expr '*' expr
    | expr '/' expr
    | expr '+' expr
    | expr '-' expr
    | '-' expr %prec UNARY_MINUS

我不知道CL-YACC是否提供等效物品;我没有在文档中找到任何内容。

您没有义务使用优先声明,我甚至不相信它们是个好主意。用一种明确的方式说出你的意思只是稍微复杂一点:

term: ID | NUMBER | ... | '(' expr0 ')'
expr0: '/' expr0
     | expr1
expr1: expr1 '+' expr2
     | expr1 '-' expr2
     | expr2
expr2: expr2 '/' expr3
     | expr2 '*' expr3
     | expr3
expr3: '-' expr3
     | term

以上是明确的,根本不需要优先声明。

(非正式)HTSQL语法说“段”运算符的语法是/ T而不是/ expr,但我没有看到T可能是什么的任何指示。在我看来,T是一个术语,也可能是一个lexpr(T := expr),这使1 / 2成为不太可能的候选人。但是,正如你所说,这是非正式的。