正则表达式删除单行SQL注释( - )

时间:2012-03-23 16:35:18

标签: c# .net sql regex vb.net

问题:

有人可以给我一个可以从SQL语句中删除单行注释的正则表达式(C#/ VB.NET)吗?

我的意思是这些评论:

-- This is a comment

不是那些

/* this is a comment */

因为我已经可以处理明星评论了。

我有一个小的解析器,当它们在行的开头时删除那些注释,但它们也可以在代码之后的某个地方或者更糟糕的情况下,在SQL字符串'hello --Test -- World'中 那些注释也应该删除(当然,除了SQL字符串中的注释 - 如果可能的话)。

令人惊讶的是我没有使用正则表达式。我会认为明星评论更难,但实际上,他们不是。

根据请求,这里我的代码删除/ ** / - 样式注释 (为了让它忽略SQL样式的字符串,你必须用uniqueidentifier替换字符串(我使用4个concated),然后应用注释删除,然后应用string-backsubstitution。

    static string RemoveCstyleComments(string strInput) 
    { 
        string strPattern = @"/[*][\w\d\s]+[*]/"; 
        //strPattern = @"/\*.*?\*/"; // Doesn't work 
        //strPattern = "/\\*.*?\\*/"; // Doesn't work 
        //strPattern = @"/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/ "; // Doesn't work 
        //strPattern = @"/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/ "; // Doesn't work 

        // http://stackoverflow.com/questions/462843/improving-fixing-a-regex-for-c-style-block-comments 
        strPattern = @"/\*(?>(?:(?>[^*]+)|\*(?!/))*)\*/";  // Works ! 

        string strOutput = System.Text.RegularExpressions.Regex.Replace(strInput, strPattern, string.Empty, System.Text.RegularExpressions.RegexOptions.Multiline); 
        Console.WriteLine(strOutput); 
        return strOutput; 
    } // End Function RemoveCstyleComments 

7 个答案:

答案 0 :(得分:6)

我会让你们所有人失望。使用正则表达式无法做到这一点。当然,很容易找到不在字符串中的注释(甚至OP也可以),真正的交易是字符串中的注释。 look arounds有一点希望,但这仍然不够。通过告诉你在一行中有一个先前的引用将不保证任何东西。唯一可以保证你的东西的是引用的奇怪之处。用正则表达式找不到的东西。所以,只需使用非正则表达式方法。

修改 这是c#代码:

        String sql = "--this is a test\r\nselect stuff where substaff like '--this comment should stay' --this should be removed\r\n";
        char[] quotes = { '\'', '"'};
        int newCommentLiteral, lastCommentLiteral = 0;
        while ((newCommentLiteral = sql.IndexOf("--", lastCommentLiteral)) != -1)
        {
            int countQuotes = sql.Substring(lastCommentLiteral, newCommentLiteral - lastCommentLiteral).Split(quotes).Length - 1;
            if (countQuotes % 2 == 0) //this is a comment, since there's an even number of quotes preceding
            {
                int eol = sql.IndexOf("\r\n") + 2;
                if (eol == -1)
                    eol = sql.Length; //no more newline, meaning end of the string
                sql = sql.Remove(newCommentLiteral, eol - newCommentLiteral);
                lastCommentLiteral = newCommentLiteral;
            }
            else //this is within a string, find string ending and moving to it
            {
                int singleQuote = sql.IndexOf("'", newCommentLiteral);
                if (singleQuote == -1)
                    singleQuote = sql.Length;
                int doubleQuote = sql.IndexOf('"', newCommentLiteral);
                if (doubleQuote == -1)
                    doubleQuote = sql.Length;

                lastCommentLiteral = Math.Min(singleQuote, doubleQuote) + 1;

                //instead of finding the end of the string you could simply do += 2 but the program will become slightly slower
            }
        }

        Console.WriteLine(sql);

这是做什么的:找到每个评论文字。对于每个,通过计算当前匹配与最后一个匹配之间的引号数来检查它是否在评论中。如果这个数字是偶数,则它是一个注释,因此将其删除(找到行的第一行并删除之间的数字)。如果它是奇数,这是在一个字符串中,找到字符串的结尾并移动到它。 Rgis片段基于一个奇怪的SQL技巧:'this'是一个有效的字符串。即使这两个引号不同。如果你的SQL语言不正确,你应该尝试一种完全不同的方法。我如果是这样的话,也会写一个程序,但是这个程序更快更直接。

答案 1 :(得分:3)

对于简单的案例

,你想要这样的东西
-{2,}.*

- {2,}寻找发生2次或更多次的破折号

。*将其余行添加到换行符

*但是,对于边缘情况,似乎SinistraD是正确的,因为你无法捕捉到所有内容,但是here is an article关于如何在C#中使用代码和正则表达式的组合来完成。

答案 2 :(得分:2)

到目前为止,这似乎对我有用;它甚至会忽略字符串中的注释such as SELECT '--not a comment--' FROM ATable

    private static string removeComments(string sql)
    {
        string pattern = @"(?<=^ ([^'""] |['][^']*['] |[""][^""]*[""])*) (--.*$|/\*(.|\n)*?\*/)";
        return Regex.Replace(sql, pattern, "", RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
    }

注意:它旨在消除/**/ - 样式注释以及--样式。删除|/\*(.|\n)*?\*/以取消/**/检查。也确定您正在使用RegexOptions.IgnorePatternWhitespace正则表达式选项!!

我希望能够处理双引号,但由于T-SQL不支持它们,你也可以摆脱|[""][^""]*[""]

改编自here

注意(2015年3月):最后,我使用了Antlr这个项目的解析器生成器。可能有一些边缘情况,正则表达式不起作用。最后,我对使用Antlr的结果更有信心,而且效果很好。

答案 3 :(得分:1)

Using System.Text.RegularExpressions;

public static string RemoveSQLCommentCallback(Match SQLLineMatch)
{
    System.Text.StringBuilder sb = new System.Text.StringBuilder();
    bool open = false; //opening of SQL String found
    char prev_ch = ' ';

    foreach (char ch in SQLLineMatch.ToString())
    {
        if (ch == '\'')
        {
            open = !open;
        }
        else if ((!open && prev_ch == '-' && ch == '-'))
        {
            break;
        }
        sb.Append(ch);
        prev_ch = ch;
    }

    return sb.ToString().Trim('-');
}

代码

public static void Main()
{
    string sqlText = "WHERE DEPT_NAME LIKE '--Test--' AND START_DATE < SYSDATE -- Don't go over today";
    //for every matching line call callback func
    string result = Regex.Replace(sqlText, ".*--.*", RemoveSQLCommentCallback);
}

让我们替换,找到所有匹配破折号短划线的行,并为每场比赛调用你的解析函数。

答案 4 :(得分:0)

我不知道C#/ VB.net正则表达式是否在某种程度上是特殊的,但传统上s/--.*//应该有效。

答案 5 :(得分:0)

在PHP中,我使用此代码取消注释SQL(仅限单行):

$sqlComments = '@(([\'"`]).*?[^\\\]\2)|((?:\#|--).*?$)\s*|(?<=;)\s+@ms';
/* Commented version
$sqlComments = '@
    (([\'"`]).*?[^\\\]\2) # $1 : Skip single & double quoted + backticked expressions
    |((?:\#|--).*?$)      # $3 : Match single line comments
    \s*                   # Trim after comments
    |(?<=;)\s+            # Trim after semi-colon
    @msx';
*/
$uncommentedSQL = trim( preg_replace( $sqlComments, '$1', $sql ) );
preg_match_all( $sqlComments, $sql, $comments );
$extractedComments = array_filter( $comments[ 3 ] );
var_dump( $uncommentedSQL, $extractedComments );

要删除所有评论,请参阅Regex to match MySQL comments

答案 6 :(得分:0)

作为最新解决方案,最简单的方法是使用ScriptDom-TSqlParser:

// https://michaeljswart.com/2014/04/removing-comments-from-sql/
// http://web.archive.org/web/*/https://michaeljswart.com/2014/04/removing-comments-from-sql/
public static string StripCommentsFromSQL(string SQL)
{
    Microsoft.SqlServer.TransactSql.ScriptDom.TSql150Parser parser = 
        new Microsoft.SqlServer.TransactSql.ScriptDom.TSql150Parser(true);

    System.Collections.Generic.IList<Microsoft.SqlServer.TransactSql.ScriptDom.ParseError> errors;


    Microsoft.SqlServer.TransactSql.ScriptDom.TSqlFragment fragments = 
        parser.Parse(new System.IO.StringReader(SQL), out errors);

    // clear comments
    string result = string.Join(
      string.Empty,
      fragments.ScriptTokenStream
          .Where(x => x.TokenType != Microsoft.SqlServer.TransactSql.ScriptDom.TSqlTokenType.MultilineComment)
          .Where(x => x.TokenType != Microsoft.SqlServer.TransactSql.ScriptDom.TSqlTokenType.SingleLineComment)
          .Select(x => x.Text));

    return result;

}

或者可以使用ANTL4 TSqlLexer

代替使用Microsoft解析器

或根本没有任何解析器:

private static System.Text.RegularExpressions.Regex everythingExceptNewLines = 
    new System.Text.RegularExpressions.Regex("[^\r\n]");


// http://drizin.io/Removing-comments-from-SQL-scripts/
// http://web.archive.org/web/*/http://drizin.io/Removing-comments-from-SQL-scripts/
public static string RemoveComments(string input, bool preservePositions, bool removeLiterals = false)
{
    //based on http://stackoverflow.com/questions/3524317/regex-to-strip-line-comments-from-c-sharp/3524689#3524689
    var lineComments = @"--(.*?)\r?\n";
    var lineCommentsOnLastLine = @"--(.*?)$"; // because it's possible that there's no \r\n after the last line comment
                                              // literals ('literals'), bracketedIdentifiers ([object]) and quotedIdentifiers ("object"), they follow the same structure:
                                              // there's the start character, any consecutive pairs of closing characters are considered part of the literal/identifier, and then comes the closing character
    var literals = @"('(('')|[^'])*')"; // 'John', 'O''malley''s', etc
    var bracketedIdentifiers = @"\[((\]\])|[^\]])* \]"; // [object], [ % object]] ], etc
    var quotedIdentifiers = @"(\""((\""\"")|[^""])*\"")"; // "object", "object[]", etc - when QUOTED_IDENTIFIER is set to ON, they are identifiers, else they are literals
                                                          //var blockComments = @"/\*(.*?)\*/";  //the original code was for C#, but Microsoft SQL allows a nested block comments // //https://msdn.microsoft.com/en-us/library/ms178623.aspx

    //so we should use balancing groups // http://weblogs.asp.net/whaggard/377025
    var nestedBlockComments = @"/\*
                         (?>
                         /\*  (?<LEVEL>)      # On opening push level
                         | 
                         \*/ (?<-LEVEL>)     # On closing pop level
                         |
                         (?! /\* | \*/ ) . # Match any char unless the opening and closing strings   
                         )+                         # /* or */ in the lookahead string
                         (?(LEVEL)(?!))             # If level exists then fail
                         \*/";

    string noComments = System.Text.RegularExpressions.Regex.Replace(input,
        nestedBlockComments + "|" + lineComments + "|" + lineCommentsOnLastLine + "|" + literals + "|" + bracketedIdentifiers + "|" + quotedIdentifiers,
        me => {
            if (me.Value.StartsWith("/*") && preservePositions)
                return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks // return new string(' ', me.Value.Length);
     else if (me.Value.StartsWith("/*") && !preservePositions)
                return "";
            else if (me.Value.StartsWith("--") && preservePositions)
                return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks
     else if (me.Value.StartsWith("--") && !preservePositions)
                return everythingExceptNewLines.Replace(me.Value, ""); // preserve only line-breaks // Environment.NewLine;
     else if (me.Value.StartsWith("[") || me.Value.StartsWith("\""))
                return me.Value; // do not remove object identifiers ever
     else if (!removeLiterals) // Keep the literal strings
         return me.Value;
            else if (removeLiterals && preservePositions) // remove literals, but preserving positions and line-breaks
     {
                var literalWithLineBreaks = everythingExceptNewLines.Replace(me.Value, " ");
                return "'" + literalWithLineBreaks.Substring(1, literalWithLineBreaks.Length - 2) + "'";
            }
            else if (removeLiterals && !preservePositions) // wrap completely all literals
         return "''";
            else
                throw new System.NotImplementedException();
        },
        System.Text.RegularExpressions.RegexOptions.Singleline | System.Text.RegularExpressions.RegexOptions.IgnorePatternWhitespace);
    return noComments;
}