Question

我正在使用以下正则表达式：

(public|private +)?function +([a-zA-Z_$][0-9a-zA-Z_$]*) *\\(([0-9a-zA-Z_$, ]*)\\) *{(.*)}

匹配以下字符串：

public function messenger(text){
sendMsg(text);
}
private function sendMsg(text){
alert(text);
}

（字符串中没有换行符，它们在正则表达式运行之前转换为空格）

我希望它捕获这两个函数，但它正在捕获： 1美元：“” 2美元：“信使” 3美元：“文字” $ 4：“sendMsg（text）;}私有函数sendMsg（text）{alert（text）;”

顺便说一句，我使用的是Javascript。

Answer 1

默认情况下，*运算符是贪婪的，消耗尽可能多的字符。试试非贪婪的等同物*?。

/((?:(?:public|private)\s+)?)function\s+([a-zA-Z_$][\w$]*)\s*\(([\w$, ]*)\)\s*{(.*?)}/

\w匹配单词，相当于[a-zA-Z0-9_]，但可以在字符类中使用。请注意，这与其中的块不匹配，例如：

function foo() {
    for (p in this) {
      ...
    }
}

除非它们支持recursion（JS不支持），否则无法使用regexp这是很棘手的，这就是为什么你需要一个合适的解析器。

Answer 2

因为你在另一个帖子中接受了我的（错误）答案，我觉得自己有点不得不发布一个合适的解决方案。这不会快而短，但希望有所帮助。

以下是我如何编写基于正则表达式的c语言解析器的方法。

<script>
/* 
Let's start with this simple utility function. It's a
kind of stubborn version of String.replace() - it
checks the string over and over again, until nothing
more can be replaced
*/

function replaceAll(str, regexp, repl) {
    str = str.toString();
    while(str.match(regexp))
        str = str.replace(regexp, repl);
    return str;
}

/*
Next, we need a function that removes specific
constructs from the text and replaces them with
special "markers", which are "invisible" for further
processing. The matches are collected in a buffer so
that they can be restored later.
*/

function isolate(type, str, regexp, buf) {
    return replaceAll(str, regexp, function($0) {
        buf.push($0);
        return "<<" + type + (buf.length - 1) + ">>";
    });
} 

/*
The following restores "isolated" strings from the
buffer:
*/

function restore(str, buf) {
    return replaceAll(str, /<<[a-z]+(\d+)>>/g, function($0, $1) {
        return buf[parseInt($1)];
    });
}

/*
Write down the grammar. Javascript regexps are
notoriously hard to read (there is no "comment"
option like in perl), therefore let's use more
readable format with spacing and substitution
variables. Note that "$string" and "$block" rules are
actually "isolate()" markers.
*/

var grammar = {
    $nothing: "",
    $space:  "\\s",
    $access: "public $space+ | private $space+ | $nothing",
    $ident:  "[a-z_]\\w*",
    $args:   "[^()]*",
    $string: "<<string [0-9]+>>",
    $block:  "<<block [0-9]+>>",
    $fun:    "($access) function $space* ($ident) $space* \\( ($args) \\) $space* ($block)"
}

/*
This compiles the grammar to pure regexps - one for
each grammar rule:
*/

function compile(grammar) {
    var re = {};
    for(var p in grammar)
        re[p] = new RegExp(
            replaceAll(grammar[p], /\$\w+/g, 
                    function($0) { return grammar[$0] }).
            replace(/\s+/g, ""), 
        "gi");
    return re;
}

/*
Let's put everything together
*/

function findFunctions(code, callback) {
    var buf = [];

    // isolate strings
    code = isolate("string", code, /"(\\.|[^\"])*"/g, buf);

    // isolate blocks in curly brackets {...}
    code = isolate("block",  code, /{[^{}]*}/g, buf);

    // compile our grammar
    var re = compile(grammar);

    // and perform an action for each function we can find
    code.replace(re.$fun, function() {
        var p = [];
        for(var i = 1; i < arguments.length; i++)
            p.push(restore(arguments[i], buf));
        return callback.apply(this, p)
    });
}
</script>

现在我们准备好测试了。我们的解析器必须能够处理转义字符串和任意嵌套块。

<code>
public function blah(arg1, arg2) {
    if("some string" == "public function") {
        callAnother("{hello}")
        while(something) {
            alert("escaped \" string");
        }
    }
}

function yetAnother() { alert("blah") }
</code>

<script>
window.onload = function() {
    var code = document.getElementsByTagName("code")[0].innerHTML;
    findFunctions(code, function(access, name, args, body) {
        document.write(
            "<br>" + 
            "<br> access= " + access +
            "<br> name= "   + name +
            "<br> args= "   + args +
            "<br> body= "   + body
        )
    });
}
</script>

Answer 3

尝试更改

(.*)

到

(.*?)

Answer 4

更改正则表达式的最后一部分：

{(.*)}

对此：

{(.*?)}

这使它“非贪婪”，因此它不会捕获到输入中的最后一个}。

请注意，如果任何函数代码包含}字符，这将会中断，但是您正在处理嵌套，这绝不是正则表达式表现良好的。

正则表达式捕获整个字符串

4 个答案: