我正在寻找单独的sql语句,这些语句可以包含注释/ * * /或字符串'test'或行注释--line注释(sql样式),用分号分隔。一个例子是:
Blah blah 'string ; ' ;
More text /* semicolon(;) inside comment */
Some more text
in multiple lines
上面的文本应该只检索两个语句,因为字符串''中的分号和注释/ * * /不应该算作分隔符。 我目前的语法是:
grammar SqlStatements;
sql_stmts:
text (';' text)* EOF
;
text:
SINGLE_LINE_COMMENT*
| STRING*
| TEXT*
;
TEXT:
~['--';\''/*']*
;
STRING
:
'\'' ('\'\''|~'\'')* '\''
;
SINGLE_LINE_COMMENT
: '--' ~[\r\n]*
;
MULTILINE_COMMENT
: '/*' .*? ( '*/' | EOF )
;
键入*时,上面的代码崩溃了。
答案 0 :(得分:2)
解析SQL的常用方法是首先拆分单个语句。这可能涉及处理定界符开关,这是需要的。当你在转储中有一个存储过程时,它必须作为一个整体处理,但需要分号作为内部语句分隔符。
这可以通过跳过注释和字符串的优化循环非常快速地完成。以下是MySQL Workbench中处理此问题的代码:
/**
* A statement splitter to take a list of sql statements and split them into individual statements,
* return their position and length in the original string (instead the copied strings).
*/
size_t MySQLParserServicesImpl::determineStatementRanges(const char *sql, size_t length,
const std::string &initial_delimiter,
std::vector<std::pair<size_t, size_t> > &ranges,
const std::string &line_break)
{
_stop = false;
std::string delimiter = initial_delimiter.empty() ? ";" : initial_delimiter;
const unsigned char *delimiter_head = (unsigned char*)delimiter.c_str();
const unsigned char keyword[] = "delimiter";
const unsigned char *head = (unsigned char *)sql;
const unsigned char *tail = head;
const unsigned char *end = head + length;
const unsigned char *new_line = (unsigned char*)line_break.c_str();
bool have_content = false; // Set when anything else but comments were found for the current statement.
while (!_stop && tail < end)
{
switch (*tail)
{
case '/': // Possible multi line comment or hidden (conditional) command.
if (*(tail + 1) == '*')
{
tail += 2;
bool is_hidden_command = (*tail == '!');
while (true)
{
while (tail < end && *tail != '*')
tail++;
if (tail == end) // Unfinished comment.
break;
else
{
if (*++tail == '/')
{
tail++; // Skip the slash too.
break;
}
}
}
if (!is_hidden_command && !have_content)
head = tail; // Skip over the comment.
}
else
tail++;
break;
case '-': // Possible single line comment.
{
const unsigned char *end_char = tail + 2;
if (*(tail + 1) == '-' && (*end_char == ' ' || *end_char == '\t' || is_line_break(end_char, new_line)))
{
// Skip everything until the end of the line.
tail += 2;
while (tail < end && !is_line_break(tail, new_line))
tail++;
if (!have_content)
head = tail;
}
else
tail++;
break;
}
case '#': // MySQL single line comment.
while (tail < end && !is_line_break(tail, new_line))
tail++;
if (!have_content)
head = tail;
break;
case '"':
case '\'':
case '`': // Quoted string/id. Skip this in a local loop.
{
have_content = true;
char quote = *tail++;
while (tail < end && *tail != quote)
{
// Skip any escaped character too.
if (*tail == '\\')
tail++;
tail++;
}
if (*tail == quote)
tail++; // Skip trailing quote char to if one was there.
break;
}
case 'd':
case 'D':
{
have_content = true;
// Possible start of the keyword DELIMITER. Must be at the start of the text or a character,
// which is not part of a regular MySQL identifier (0-9, A-Z, a-z, _, $, \u0080-\uffff).
unsigned char previous = tail > (unsigned char *)sql ? *(tail - 1) : 0;
bool is_identifier_char = previous >= 0x80
|| (previous >= '0' && previous <= '9')
|| ((previous | 0x20) >= 'a' && (previous | 0x20) <= 'z')
|| previous == '$'
|| previous == '_';
if (tail == (unsigned char *)sql || !is_identifier_char)
{
const unsigned char *run = tail + 1;
const unsigned char *kw = keyword + 1;
int count = 9;
while (count-- > 1 && (*run++ | 0x20) == *kw++)
;
if (count == 0 && *run == ' ')
{
// Delimiter keyword found. Get the new delimiter (everything until the end of the line).
tail = run++;
while (run < end && !is_line_break(run, new_line))
run++;
delimiter = base::trim(std::string((char *)tail, run - tail));
delimiter_head = (unsigned char*)delimiter.c_str();
// Skip over the delimiter statement and any following line breaks.
while (is_line_break(run, new_line))
run++;
tail = run;
head = tail;
}
else
tail++;
}
else
tail++;
break;
}
default:
if (*tail > ' ')
have_content = true;
tail++;
break;
}
if (*tail == *delimiter_head)
{
// Found possible start of the delimiter. Check if it really is.
size_t count = delimiter.size();
if (count == 1)
{
// Most common case. Trim the statement and check if it is not empty before adding the range.
head = skip_leading_whitespace(head, tail);
if (head < tail)
ranges.push_back(std::make_pair<size_t, size_t>(head - (unsigned char *)sql, tail - head));
head = ++tail;
have_content = false;
}
else
{
const unsigned char *run = tail + 1;
const unsigned char *del = delimiter_head + 1;
while (count-- > 1 && (*run++ == *del++))
;
if (count == 0)
{
// Multi char delimiter is complete. Tail still points to the start of the delimiter.
// Run points to the first character after the delimiter.
head = skip_leading_whitespace(head, tail);
if (head < tail)
ranges.push_back(std::make_pair<size_t, size_t>(head - (unsigned char *)sql, tail - head));
tail = run;
head = run;
have_content = false;
}
}
}
}
// Add remaining text to the range list.
head = skip_leading_whitespace(head, tail);
if (head < tail)
ranges.push_back(std::make_pair<size_t, size_t>(head - (unsigned char *)sql, tail - head));
return 0;
}
这也适用于大型sql脚本,并且可以在大约1秒内拆分包含100万行的转储(当然取决于你运行它的盒子)。 var _stop是一个用于允许破坏拆分过程的标志。代码正在处理MySQL代码,因此它正确处理隐藏的命令(版本注释)。
使用每个查询的开始和长度信息,您现在可以转到解析器。
答案 1 :(得分:1)
即使Mike的答案很好,我也需要在antlr中创建语法。以下语法对我有用:
sql_stmts:
sql_stmt (';'+ sql_stmt)*
;
sql_stmt:
TEXT*
;
TEXT:
~[']
| STRING
;
BLOCK_COMMENT
: '/*' .*? ( '*/' | EOF ) -> channel(HIDDEN)
;
LINE_COMMENT
: '--' ~[\r\n]* -> channel(HIDDEN)
;
SPACES
: [ \u000B\t\r\n] -> channel(HIDDEN)
;
STRING
:
'\'' ('\'\''|~'\'')* '\''
;
答案 2 :(得分:0)
首先,不要忽略编译语法时产生的警告和错误消息。
其次,TEXT规则没有按照您的想法执行 - 引用不在那里工作。请参阅doc。
第三,你的第一行输入实际上是TEXT STRING TEXT SEMI。第二个TEXT是SEMI规则之前的空格,但是你的规则只允许在SEMI之前单个非连续出现TEXT。