正则表达式匹配文件中的特定函数及其参数

时间:2014-11-19 14:46:38

标签: regex parsing pcre

我正在使用gettext javascript解析器,而且我仍然坚持使用解析正则表达式。

我需要捕获传递给特定方法调用_n(_(的每个参数。例如,如果我在我的javascript文件中有这些:

_("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls.. 

这引用了此文档:http://poedit.net/trac/wiki/Doc/Keywords

我计划两次(以及两个正则表达式):

  1. 捕获_n(_(方法调用的所有函数参数
  2. 只抓住那些串状的
  3. 基本上,我想要一个正则表达式,可以说"捕捉_n(_(后的所有内容,并在最后一个括号)停止功能已完成。如果可以使用正则表达式并且没有javascript解析器,我不知道。

    还可以做的是"捕获每个"字符串"或者'字符串'在_n(_(之后,在该行的结尾处或在新的_n(_(字符的开头停止。

    在我完成的所有事情中,我要么被_( "one (optional)" );括在内侧括号内,要么apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples)在同一行上有两次调用。

    这是我到目前为止实施的,具有不完美的正则表达式:generic parserjavascript onehandlebars one

6 个答案:

答案 0 :(得分:8)

  

注意: Read this answer如果您不熟悉递归。

第1部分:匹配特定功能

谁说正则表达式不能模块化?那么PCRE正在拯救!

~                      # Delimiter
(?(DEFINE)             # Start of definitions
   (?P<str_double_quotes>
      (?<!\\)          # Not escaped
      "                # Match a double quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      "                # Match the ending double quote
   )

   (?P<str_single_quotes>
      (?<!\\)          # Not escaped
      '                # Match a single quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      '                # Match the ending single quote
   )

   (?P<brackets>
      \(                          # Match an opening bracket
         (?:                      # A non capturing group
            (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
            |                     # Or
            (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
            |                     # Or
            [^()]                 # Anything not a bracket
            |                     # Or
            (?&brackets)          # Recurse the bracket pattern
         )*
      \)
   )
)                                 # End of definitions
# Let's start matching for real now:
_n?                               # Match _ or _n
\s*                               # Optional white spaces
(?P<results>(?&brackets))         # Recurse/use the brackets pattern and put it in the results group
~sx

s用于将换行符与.匹配,x修饰符用于我们正则表达式的这种奇特的间距和注释。

Online regex demo Online php demo

第2部分:摆脱开放&amp;右括号

由于我们的正则表达式也会得到开始和结束括号(),我们可能需要过滤它们。我们将在结果中使用preg_replace()

~           # Delimiter
^           # Assert begin of string
\(          # Match an opening bracket
\s*         # Match optional whitespaces
|           # Or
\s*         # Match optional whitespaces
\)          # Match a closing bracket
$           # Assert end of string
~x

Online php demo

第3部分:提取参数

所以这是另一个模块化正则表达式,你甚至可以添加自己的语法:

~                      # Delimiter
(?(DEFINE)             # Start of definitions
   (?P<str_double_quotes>
      (?<!\\)          # Not escaped
      "                # Match a double quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      "                # Match the ending double quote
   )

   (?P<str_single_quotes>
      (?<!\\)          # Not escaped
      '                # Match a single quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      '                # Match the ending single quote
   )

   (?P<array>
      Array\s*
      (?&brackets)
   )

   (?P<variable>
      [^\s,()]+        # I don't know the exact grammar for a variable in ECMAScript
   )

   (?P<brackets>
      \(                          # Match an opening bracket
         (?:                      # A non capturing group
            (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
            |                     # Or
            (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
            |                     # Or
            (?&array)             # Recurse/use the array pattern
            |                     # Or
            (?&variable)          # Recurse/use the array pattern
            |                     # Or
            [^()]                 # Anything not a bracket
            |                     # Or
            (?&brackets)          # Recurse the bracket pattern
         )*
      \)
   )
)                                 # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&variable)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
~xis

我们将循环并使用preg_match_all()。最终代码如下所示:

$functionPattern = <<<'regex'
~                      # Delimiter
(?(DEFINE)             # Start of definitions
   (?P<str_double_quotes>
      (?<!\\)          # Not escaped
      "                # Match a double quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      "                # Match the ending double quote
   )

   (?P<str_single_quotes>
      (?<!\\)          # Not escaped
      '                # Match a single quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      '                # Match the ending single quote
   )

   (?P<brackets>
      \(                          # Match an opening bracket
         (?:                      # A non capturing group
            (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
            |                     # Or
            (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
            |                     # Or
            [^()]                 # Anything not a bracket
            |                     # Or
            (?&brackets)          # Recurse the bracket pattern
         )*
      \)
   )
)                                 # End of definitions
# Let's start matching for real now:
_n?                               # Match _ or _n
\s*                               # Optional white spaces
(?P<results>(?&brackets))         # Recurse/use the brackets pattern and put it in the results group
~sx
regex;


$argumentsPattern = <<<'regex'
~                      # Delimiter
(?(DEFINE)             # Start of definitions
   (?P<str_double_quotes>
      (?<!\\)          # Not escaped
      "                # Match a double quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      "                # Match the ending double quote
   )

   (?P<str_single_quotes>
      (?<!\\)          # Not escaped
      '                # Match a single quote
      (?:              # Non-capturing group
         [^\\]         # Match anything not a backslash
         |             # Or
         \\.           # Match a backslash and a single character (ie: an escaped character)
      )*?              # Repeat the non-capturing group zero or more times, ungreedy/lazy
      '                # Match the ending single quote
   )

   (?P<array>
      Array\s*
      (?&brackets)
   )

   (?P<variable>
      [^\s,()]+        # I don't know the exact grammar for a variable in ECMAScript
   )

   (?P<brackets>
      \(                          # Match an opening bracket
         (?:                      # A non capturing group
            (?&str_double_quotes) # Recurse/use the str_double_quotes pattern
            |                     # Or
            (?&str_single_quotes) # Recurse/use the str_single_quotes pattern
            |                     # Or
            (?&array)             # Recurse/use the array pattern
            |                     # Or
            (?&variable)          # Recurse/use the array pattern
            |                     # Or
            [^()]                 # Anything not a bracket
            |                     # Or
            (?&brackets)          # Recurse the bracket pattern
         )*
      \)
   )
)                                 # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
|
(?&variable)
~six
regex;

$input = <<<'input'
_  ("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..

// misleading cases
_n("foo (")
_n("foo (\)", 'foo)', aa)
_n( Array(1, 2, 3), Array(")",   '(')   );
_n(function(foo){return foo*2;}); // Is this even valid?
_n   ();   // Empty
_ (   
    "Foo",
    'Bar',
    Array(
        "wow",
        "much",
        'whitespaces'
    ),
    multiline
); // PCRE is awesome
input;

if(preg_match_all($functionPattern, $input, $m)){
    $filtered = preg_replace(
        '~          # Delimiter
        ^           # Assert begin of string
        \(          # Match an opening bracket
        \s*         # Match optional whitespaces
        |           # Or
        \s*         # Match optional whitespaces
        \)          # Match a closing bracket
        $           # Assert end of string
        ~x', // Regex
        '', // Replace with nothing
        $m['results'] // Subject
    ); // Getting rid of opening & closing brackets

    // Part 3: extract arguments:
    $parsedTree = array();
    foreach($filtered as $arguments){   // Loop
        if(preg_match_all($argumentsPattern, $arguments, $m)){ // If there's a match
            $parsedTree[] = array(
                'all_arguments' => $arguments,
                'branches' => $m[0]
            ); // Add an array to our tree and fill it
        }else{
            $parsedTree[] = array(
                'all_arguments' => $arguments,
                'branches' => array()
            ); // Add an array with empty branches
        }
    }

    print_r($parsedTree); // Let's see the results;
}else{
    echo 'no matches';
}

Online php demo

您可能想要创建一个递归函数来生成完整的树。 See this answer

您可能会注意到function(){}部分未正确解析。我会把它作为读者的练习:))

答案 1 :(得分:1)

试试这个:

(?<=\().*?(?=\s*\)[^)]*$)

请参阅live demo

答案 2 :(得分:0)

以下正则表达式可以帮助您。

^(?=\w+\()\w+?\(([\s'!\\\)",\w]+)+\);

检查demo here

答案 3 :(得分:0)

\(( |"(\\"|[^"])*"|'(\\'|[^'])*'|[^)"'])*?\)

这应该在一对括号之间得到任何东西,忽略引号中的括号。 说明:

\( // Literal open paren
    (
         | //Space or
        "(\\"|[^"])*"| //Anything between two double quotes, including escaped quotes, or
        '(\\'|[^'])*'| //Anything between two single quotes, including escaped quotes, or
        [^)"'] //Any character that isn't a quote or close paren
    )*? // All that, as many times as necessary
\) // Literal close paren

无论你如何切片,正则表达式都会导致问题。它们难以阅读,难以维护且效率极低。我不熟悉gettext,但也许你可以使用for循环?

// This is just pseudocode.  A loop like this can be more readable, maintainable, and predictable than a regular expression.
for(int i = 0; i < input.length; i++) {
    // Ignoring anything that isn't an opening paren
    if(input[i] == '(') {
        String capturedText = "";
        // Loop until a close paren is reached, or an EOF is reached
        for(; input[i] != ')' && i < input.length; i++) {
            if(input[i] == '"') {
                // Loop until an unescaped close quote is reached, or an EOF is reached
                for(; (input[i] != '"' || input[i - 1] == '\\') && i < input.length; i++) {
                    capturedText += input[i];
                }
            }
            if(input[i] == "'") {
                // Loop until an unescaped close quote is reached, or an EOF is reached
                for(; (input[i] != "'" || input[i - 1] == '\\') && i < input.length; i++) {
                    capturedText += input[i];
                }
            }
            capturedText += input[i];
        }
        capture(capturedText);
    }
}

注意:我没有介绍如何确定它是函数还是仅仅是分组符号。 (即,这将匹配a = (b * c))。这很复杂,详见here。随着您的代码变得越来越准确,您越来越接近编写自己的javascript解析器。如果您需要这种准确性,您可能需要查看实际javascript解析器的源代码。

答案 4 :(得分:0)

一点代码(你可以在http://writecodeonline.com/php/测试这个PHP代码来检查):

$string = '_("foo")
_n("bar", "baz", 42); 
_n(domain, "bux", var);
_( "one (optional)" );
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples)';

preg_match_all('/(?<=(_\()|(_n\())[\w", ()%]+(?=\))/i', $string, $matches);

foreach($matches[0] as $test){
    $opArr = explode(',', $test);
    foreach($opArr as $test2){
       echo trim($test2) . "\n";
       }
    }

您可以在此处查看初始模式及其工作原理:http://regex101.com/r/fR7eU2/1

输出是:

"foo"
"bar"
"baz"
42
domain
"bux"
var
"one (optional)"
"No apples"
"%1 apple"
"%1 apples"
apples

答案 5 :(得分:-1)

我们可以分两步完成:

1)捕获_n(或_(方法调用

)的所有函数参数
(?:_\(|_n\()(?:[^()]*\([^()]*\))*[^()]*\)

参见演示。

http://regex101.com/r/oE6jJ1/13

2)只抓住那些串状的

"([^"]*)"|(?:\(|,)\s*([^"),]*)(?=,|\))

参见演示。

http://regex101.com/r/oE6jJ1/14