语法用于解析antlr4中由分号分隔的sql语句

时间:2015-01-27 16:27:13

标签: grammar antlr4

我正在寻找单独的sql语句,这些语句可以包含注释/ * * /或字符串'test'或行注释--line注释(sql样式),用分号分隔。一个例子是:

Blah blah 'string ; ' ;
More text /* semicolon(;) inside comment */
Some more text 
in multiple lines

上面的文本应该只检索两个语句,因为字符串''中的分号和注释/ * * /不应该算作分隔符。 我目前的语法是:

grammar SqlStatements;

sql_stmts:
    text (';' text)* EOF
    ;

text:
     SINGLE_LINE_COMMENT*
    | STRING* 
    | TEXT* 
;

TEXT:
 ~['--';\''/*']* 
;

STRING
: 
     '\'' ('\'\''|~'\'')* '\''

;

SINGLE_LINE_COMMENT
 : '--' ~[\r\n]* 
 ;

MULTILINE_COMMENT
 : '/*' .*? ( '*/' | EOF ) 
 ;

键入*时,上面的代码崩溃了。

3 个答案:

答案 0 :(得分:2)

解析SQL的常用方法是首先拆分单个语句。这可能涉及处理定界符开关,这是需要的。当你在转储中有一个存储过程时,它必须作为一个整体处理,但需要分号作为内部语句分隔符。

这可以通过跳过注释和字符串的优化循环非常快速地完成。以下是MySQL Workbench中处理此问题的代码:

/**
 * A statement splitter to take a list of sql statements and split them into individual statements,
 * return their position and length in the original string (instead the copied strings).
 */
size_t MySQLParserServicesImpl::determineStatementRanges(const char *sql, size_t length,
                                                         const std::string &initial_delimiter,
                                                         std::vector<std::pair<size_t, size_t> > &ranges,
                                                         const std::string &line_break)
{
  _stop = false;
  std::string delimiter = initial_delimiter.empty() ? ";" : initial_delimiter;
  const unsigned char *delimiter_head = (unsigned char*)delimiter.c_str();

  const unsigned char keyword[] = "delimiter";

  const unsigned char *head = (unsigned char *)sql;
  const unsigned char *tail = head;
  const unsigned char *end = head + length;
  const unsigned char *new_line = (unsigned char*)line_break.c_str();
  bool have_content = false; // Set when anything else but comments were found for the current statement.

  while (!_stop && tail < end)
  {
    switch (*tail)
    {
      case '/': // Possible multi line comment or hidden (conditional) command.
        if (*(tail + 1) == '*')
        {
          tail += 2;
          bool is_hidden_command = (*tail == '!');
          while (true)
          {
            while (tail < end && *tail != '*')
              tail++;
            if (tail == end) // Unfinished comment.
              break;
            else
            {
              if (*++tail == '/')
              {
                tail++; // Skip the slash too.
                break;
              }
            }
          }

          if (!is_hidden_command && !have_content)
            head = tail; // Skip over the comment.
        }
        else
          tail++;

        break;

      case '-': // Possible single line comment.
      {
        const unsigned char *end_char = tail + 2;
        if (*(tail + 1) == '-' && (*end_char == ' ' || *end_char == '\t' || is_line_break(end_char, new_line)))
        {
          // Skip everything until the end of the line.
          tail += 2;
          while (tail < end && !is_line_break(tail, new_line))
            tail++;
          if (!have_content)
            head = tail;
        }
        else
          tail++;

        break;
      }

      case '#': // MySQL single line comment.
        while (tail < end && !is_line_break(tail, new_line))
          tail++;
        if (!have_content)
          head = tail;
        break;

      case '"':
      case '\'':
      case '`': // Quoted string/id. Skip this in a local loop.
      {
        have_content = true;
        char quote = *tail++;
        while (tail < end && *tail != quote)
        {
          // Skip any escaped character too.
          if (*tail == '\\')
            tail++;
          tail++;
        }
        if (*tail == quote)
          tail++; // Skip trailing quote char to if one was there.

        break;
      }

      case 'd':
      case 'D':
      {
        have_content = true;

        // Possible start of the keyword DELIMITER. Must be at the start of the text or a character,
        // which is not part of a regular MySQL identifier (0-9, A-Z, a-z, _, $, \u0080-\uffff).
        unsigned char previous = tail > (unsigned char *)sql ? *(tail - 1) : 0;
        bool is_identifier_char = previous >= 0x80
        || (previous >= '0' && previous <= '9')
        || ((previous | 0x20) >= 'a' && (previous | 0x20) <= 'z')
        || previous == '$'
        || previous == '_';
        if (tail == (unsigned char *)sql || !is_identifier_char)
        {
          const unsigned char *run = tail + 1;
          const unsigned char *kw = keyword + 1;
          int count = 9;
          while (count-- > 1 && (*run++ | 0x20) == *kw++)
            ;
          if (count == 0 && *run == ' ')
          {
            // Delimiter keyword found. Get the new delimiter (everything until the end of the line).
            tail = run++;
            while (run < end && !is_line_break(run, new_line))
              run++;
            delimiter = base::trim(std::string((char *)tail, run - tail));
            delimiter_head = (unsigned char*)delimiter.c_str();

            // Skip over the delimiter statement and any following line breaks.
            while (is_line_break(run, new_line))
              run++;
            tail = run;
            head = tail;
          }
          else
            tail++;
        }
        else
          tail++;

        break;
      }

      default:
        if (*tail > ' ')
          have_content = true;
        tail++;
        break;
    }

    if (*tail == *delimiter_head)
    {
      // Found possible start of the delimiter. Check if it really is.
      size_t count = delimiter.size();
      if (count == 1)
      {
        // Most common case. Trim the statement and check if it is not empty before adding the range.
        head = skip_leading_whitespace(head, tail);
        if (head < tail)
          ranges.push_back(std::make_pair<size_t, size_t>(head - (unsigned char *)sql, tail - head));
        head = ++tail;
        have_content = false;
      }
      else
      {
        const unsigned char *run = tail + 1;
        const unsigned char *del = delimiter_head + 1;
        while (count-- > 1 && (*run++ == *del++))
          ;

        if (count == 0)
        {
          // Multi char delimiter is complete. Tail still points to the start of the delimiter.
          // Run points to the first character after the delimiter.
          head = skip_leading_whitespace(head, tail);
          if (head < tail)
            ranges.push_back(std::make_pair<size_t, size_t>(head - (unsigned char *)sql, tail - head));
          tail = run;
          head = run;
          have_content = false;
        }
      }
    }
  }

  // Add remaining text to the range list.
  head = skip_leading_whitespace(head, tail);
  if (head < tail)
    ranges.push_back(std::make_pair<size_t, size_t>(head - (unsigned char *)sql, tail - head));

  return 0;
}

这也适用于大型sql脚本,并且可以在大约1秒内拆分包含100万行的转储(当然取决于你运行它的盒子)。 var _stop是一个用于允许破坏拆分过程的标志。代码正在处理MySQL代码,因此它正确处理隐藏的命令(版本注释)。

使用每个查询的开始和长度信息,您现在可以转到解析器。

答案 1 :(得分:1)

即使Mike的答案很好,我也需要在antlr中创建语法。以下语法对我有用:

sql_stmts:
    sql_stmt (';'+ sql_stmt)*
    ;

sql_stmt:
    TEXT*
;
TEXT:
~[']
| STRING
;
BLOCK_COMMENT
: '/*' .*? ( '*/' | EOF ) -> channel(HIDDEN)
;
LINE_COMMENT
: '--' ~[\r\n]* -> channel(HIDDEN)
;
SPACES
 : [ \u000B\t\r\n] -> channel(HIDDEN)
 ;

STRING
: 
     '\'' ('\'\''|~'\'')* '\'' 
;

答案 2 :(得分:0)

首先,不要忽略编译语法时产生的警告和错误消息。

其次,TEXT规则没有按照您的想法执行 - 引用不在那里工作。请参阅doc

第三,你的第一行输入实际上是TEXT STRING TEXT SEMI。第二个TEXT是SEMI规则之前的空格,但是你的规则只允许在SEMI之前单个非连续出现TEXT。