正则表达式 - 匹配类似IRC的参数?

时间:2013-02-06 10:44:12

标签: javascript regex string-parsing

我希望创建一个类似IRC的命令格式:

/commandname parameter1 "parameter 2" "parameter \"3\"" parameter"4 parameter\"5

哪个(理想情况下)会给我一个参数列表:

parameter1
parameter 2
parameter "3"
parameter"4
parameter\"5

现在从我所读到的内容来看,这并不是一件轻而易举的事情,也可能在其他方法中完成。

思考?

以下是完成我需要的工作的C#代码:

public List<string> ParseIrcCommand(string command)
    {
        command = command.Trim();
        command = command.TrimStart(new char[] { '/' });
        command += ' ';

        List<string> Tokens = new List<string>();

        int tokenStart = 0;
        bool inQuotes = false;
        bool inToken = true;
        string currentToken = "";
        for (int i = tokenStart; i < command.Length; i++)
        {
            char currentChar = command[i];
            char nextChar = (i + 1 >= command.Length ? ' ' : command[i + 1]);

            if (!inQuotes && inToken && currentChar == ' ')
            {
                Tokens.Add(currentToken);
                currentToken = "";
                inToken = false;
                continue;
            }

            if (inQuotes && inToken && currentChar == '"')
            {
                Tokens.Add(currentToken);
                currentToken = "";
                inQuotes = false;
                inToken = false;
                if (nextChar == ' ') i++;
                continue;
            }

            if (inQuotes && inToken && currentChar == '\\' && nextChar == '"')
            {
                i++;
                currentToken += nextChar;
                continue;
            }

            if (!inToken && currentChar != ' ')
            {
                inToken = true;
                tokenStart = i;
                if (currentChar == '"')
                {
                    tokenStart++;
                    inQuotes = true;
                    continue;
                }
            }

            currentToken += currentChar;
        }

        return Tokens;
    }

2 个答案:

答案 0 :(得分:4)

你已经展示了你的代码 - 这很好,但似乎你没有想过解析这样的命令是否合理:

  • 首先,您的代码将允许命令名称和参数中的新行字符。如果你假设新行字符永远不会存在,那将是合理的。
  • 其次,\也需要像"一样进行转义,因为在参数末尾无法指定单个\而不会引起任何混淆。
  • 第三,将命令名称解析为与参数相同的方式有点奇怪 - 命令名称通常是确定和修复的,因此不需要允许灵活的方式来指定它。

我无法想到JavaScript中的单行解决方案 general 。 JavaScript正则表达式缺少\G,它会断言最后一个匹配边界。所以我的解决方案将不得不处理字符串断言^的开始,并在匹配令牌时扼杀字符串。

(这里的代码不多,主要是评论)

function parseCommand(str) {
    /*
     * Trim() in C# will trim off all whitespace characters
     * \s in JavaScript regex also match any whitespace character
     * However, the set of characters considered as whitespace might not be
     * equivalent
     * But you can be sure that \r, \n, \t, space (ASCII 32) are included.
     * 
     * However, allowing all those whitespace characters in the command
     * is questionable.
     */
    str = str.replace(/^\s*\//, "");

    /* Look-ahead (?!") is needed to prevent matching of quoted parameter with
     * missing closing quote
     * The look-ahead comes from the fact that your code does not backtrack
     * while the regex engine will backtrack. Possessive qualifier can prevent
     * backtracking, but it is not supported by JavaScript RegExp.
     *
     * We emulate the effect of \G by using ^ and repeatedly chomping off
     * the string.
     *
     * The regex will match 2 cases:
     * (?!")([^ ]+)
     * This will match non-quoted tokens, which are not allowed to 
     * contain spaces
     * The token is captured into capturing group 1
     *
     * "((?:[^\\"]|\\[\\"])*)"
     * This will match quoted tokens, which consists of 0 or more:
     * non-quote-or-backslash [^\\"] OR escaped quote \"
     * OR escaped backslash \\
     * The text inside the quote is captured into capturing group 2
     */
    var regex = /^ *(?:(?!")([^ ]+)|"((?:[^\\"]|\\[\\"])*)")/;
    var tokens = [];
    var arr;

    while ((arr = str.match(regex)) !== null) {
        if (arr[1] !== void 0) {
            // Non-space token
            tokens.push(arr[1]);
        } else {
            // Quoted token, needs extra processing to
            // convert escaped character back
            tokens.push(arr[2].replace(/\\([\\"])/g, '$1'));
        }

        // Remove the matched text
        str = str.substring(arr[0].length);
    }

    // Test that the leftover consists of only space characters
    if (/^ *$/.test(str)) {
        return tokens;
    } else {
        // The only way to reach here is opened quoted token
        // Your code returns the tokens successfully parsed
        // but I think it is better to show an error here.
        return null;
    }
}

答案 1 :(得分:0)

我创建了一个与您编写的命令行匹配的简单正则表达式。

/\w+\s((("([^\\"]*\\")*[^\\"]*")|[^ ]+)(\b|\s+))+$
  • /\w+\s找到命令的第一部分
  • (((
  • "([^\\"]*\\")*找到以"开头且不包含\"后跟\"一次或多次的任何字符串(因此允许"something\",{ <1}}等等
  • "some\"thing\"后跟不包含[^\\"]*"\的字符列表,最后是"
  • "这是另一种选择:查找任何非空格字符序列
  • )|[^ ]+
  • )所有人都遵循空格或字边界
  • (\b|\s+)一次或多次,每个命令一个,直到字符串结尾

我担心这有时会失败,但我发布这个以表明有时参数有一个基于重复的结构,例如参见)+$重复结构为"something\"something\"something\"end",以及你可以用这个想法建立你的正则表达式