Question

我正在查看一个字符串并尝试将所有内容都放在括号内。内容可能会发生变化，某些情况下可能不存在最大值和最小值。

get(max(fieldname1),min(fieldname2),fieldname3)where(something=something) sort(fieldname2 asc)

不保证where（）和sort（）不存在。
每组之间可能有空格，[编辑]关键字可能并不总是相同。

get(something) where(something)
get(something)where(something) sort(something)

应该使用什么样的正则表达式？实际上，它应该返回：

Array (
[0] => max(fieldname1),min(fieldname2),fieldname3
[1] => something=something
[2] => fieldname2 asc
)

我意识到将第一组括号更改为{或[可以解决问题，但我很顽固，并希望通过正则表达式这样做。

EDIT 我可以用preg_match_all（）

来做到最好

/[a-zA-Z0-9_]+\((.*?)\)/

Answer 1

您最好使用解析器，例如：

$str = 'get(max(fieldname1),min(fieldname2),fieldname3)where(something=something) sort(fieldname2 asc)';
$array = array();
$buffer = '';
$depth = 0;
for ($i=0; $i<strlen($str); $i++) {
    $buffer .= $str[$i];
    switch ($str[$i]) {
        case '(':
            $depth++;
            break;
        case ')':
            $depth--;
            if ($depth === 0) {
                $array[] = $buffer;
                $buffer = '';
            }
            break;
    }
}
var_dump($array);

Answer 2

既然你澄清那些是可选的，我不相信这可以用正则表达式做。您可以通过将不同的子句（get，where，sort）保存在自己的字符串中来实现，但我认为您不能按原样执行此操作。

再次编辑：它在概念上与昨天的这个问题有点类似，这被证明是不可能用正则表达式做的： Regex for checking if a string has mismatched parentheses?

Answer 3

怎么样？

^\s*get\((.*?)\)(?:\s*where\((.*?)\))(?:\s*sort\((.*?)\)\s*)?$

现在我不相信这会奏效。例如，第一个匹配（for get）可能会溢出到where和sort子句中。您可能能够使用前瞻来处理此问题，例如：

^\s*get\(((?:.(?!sort|where))*?)\)(?:\s*where\(((?:.(?!sort))*?)\))(?:\s*sort\((.*?)\)\s*)?$

但实际上这是一个相当粗糙的正则表达式，Gumbo是正确的，因为解析器可以说是更好的方法。对于任何匹配元素的情况都是如此。 HTML / XML是经常使用正则表达式的经典案例。在这些情况下情况更糟，因为解析器是免费提供和成熟的。

有很多案例要处理这样的事情：

表达式部分的可选性;
来自文字的虚假信号，例如get（“）sort”）将破坏上述内容;
逃脱角色;
嵌套。

Chad指出了我所谈论的匹配对问题，值得重新审视。假设您有以下HTML：

<div>
  <div></div>
</div>

使用正则表达式获取匹配的标记对是不可能的（但是人们一直在尝试或者只是不考虑输入的类型）。是什么让你的案例可能可行，你可以使用一些已知的标记：

关键字get，where和sort;和
字符串的开头和结尾。

但老实说，正则表达式不是推荐的方法。

因此，如果您想要一些健壮且可靠的东西，请编写一个解析器。正是这种事情的正则表达式只不过是一个快速而肮脏的解决方案。

Answer 4

我支持有关正则表达式不适合这样的通用结构的说法。但是，如果括号是平衡的并且不超过两个，那么这些正则表达式可能有所帮助：

(\w+\s*\([^()]*(?:(?:\([^()]*\))[^()]*)*\)\s*)

匹配并捕获单个xyz（....）实例，而

(\w+\s*\([^()]*(?:(?:\([^()]*\))[^()]*)*\)\s*)+

匹配所有这些。根据您的语言，您可以使用第二个并解开单个组中的多个捕获。 This reference可能会有所帮助。

但是，重复一遍，我不认为正则表达式是这样的 - 这就是为什么这个相当严格的解决方案如此拙劣。

抱歉，只是注意到你是PHP。你可能需要使用它：

(\w+\s*\([^()]*(?:(?:\([^()]*\))[^()]*)*\)\s*)(.*)

将您的线分为（单件）加（休息）并循环，直到没有任何剩余。

Answer 5

这是非常黑客的做法，可能会做得更好，但只是作为概念的证明：

get\((max\(.+?\)),(min\(.+?\)),(.+?)\)(where\((.+?=.+?)\)| where\((.+?=.+?)\)|)(sort\((.+?)\)| sort\((.+?)\)|)

数据位置将根据是否找到信息在匹配数组中更改。你可以测试一下there！

Answer 6

我坐了一会儿，写了一个完全承诺的FSM解析器，只是为了感兴趣。

它有一些你不可能用正则表达式看到的功能（至少在PHP下，我可以用Perl中的递归正则表达式来做，但不是PHP，它还没有这个功能）。

智能和基于堆栈的括号解析
AnyBracket支持
模块化
可扩展的。
当语法错误时，它可以告诉你在哪里。

虽然这里有一大堆代码，但很多代码对于新编码员来说有点古怪和令人费解，但就其本身而言，它非常棒的东西。

它不是一个成品，只是我扔在一起，但它的工作原理并没有任何我能找到的错误。

我已经在许多地方“死”了。通常情况下，使用Exceptions和诸如此类的东西通常会更好，因此在推出之前，最好先进行清理和重构。

它有合理的评论量，但我觉得如果我进一步评论有限状态加工的细节将更难理解。



# Pretty Colour Debug of the tokeniser in action. 
# Uncomment to use. 
function debug( $title, $stream, $msg, $remaining ){ 
#  print chr(27) ."[31m$title" . chr(27) ."[0m\n";
# print chr(27) ."[33min:$stream" . chr(27) ."[0m\n";
#  print chr(27) ."[32m$msg" . chr(27) ."[0m\n";
#  print chr(27) ."[34mstream:$remaining" . chr(27) ."[0m\n\n";
}

# Simple utility to store a captured part of the stream in one place
# and the remainder somewhere else
# Wraps most the regexy stuff 
# Insprired by some Perl Regex Parser I found. 

function get_token( $regex, $input ){ 
  $out = array( 
      'success' => false,
      'match' => '',
      'rest' => ''
  );
  if( !preg_match( '/^' . $regex . '/' , $input, $matches ) ){
    die("Could not match $regex at start of $input ");
    #return $out; # error condition, not matched. 
  }
  $out['match'] = $matches[1];
  $out['rest'] = substr( $input, strlen( $out['match'] ) );
  $out['success'] = true;
  debug( 'Scan For Token: '. $regex , $input, "matched: " . $out['match'] , $out['rest'] );
  return $out;
}


function skip_space( $input ){ 
  return get_token('(\s*)', $input ); 
}

# Given $input and $opener, find 
# the data stream that occurs until the respecive closer. 
# All nested bracket sets must be well balanced. 
# No 'escape code' implementation has been done (yet) 
# Match will contain the contents, 
# Rest will contain unprocessed part of the string
# []{}() and  bracket types are currently supported. 

function close_bracket( $input , $opener ){
  $out = array( 
      'success' => false,
      'match' => '',
      'rest' => ''
  );

  $map = array( '(' => ')', '[' => ']', '{' => '}', chr(60) => '>' );
  $nests = array( $map[$opener] ); 

  while( strlen($input) > 0 ){ 
    $d = get_token( '([^()\[\]{}' . chr(60). '>]*?[()\[\]{}' . chr(60)  . '>])', $input ); 
    $input = $d['rest']; 

    if( !$d['success'] ){  
      debug( 'Scan For ) Bailing ' , $input, "depth: $nests, matched: " . $out['match'] , $out['rest'] );

      $out['match'] .= $d['match'];
      return $out; # error condition, not matched. brackets are imbalanced. 
    }

# Work out which of the 4 bracket types we got, and
# Which orientation it is, and then decide if were going up the tree or down it

    end($nests);
    $tail = substr( $d['match'], -1, 1 );
    if( $tail == current($nests) ){ 
      array_pop( $nests );
    } elseif ( array_key_exists( $tail, $map ) ){ 
      array_push( $nests, $map[$tail] ); 
    } else {
      die ("Error. Bad bracket Matching, unclosed/unbalanced/unmatching bracket sequence: " . $out['match'] . $d['match'] );
    }
    $out['match'] .= $d['match'] ; 
    $out['rest' ]  = $d['rest'];
    debug( 'Scan For ) running' , $input, "depth: $nests, matched: " . $out['match'] , $out['rest'] );

    if ( count($nests) == 0 ){ 
      # Chomp off the tail bracket to just get the body
      $out['match'] = substr( $out['match'] , 0 , -1 );
      $out['success'] = true;
      debug( 'Scan For ) returning ' , $input, "matched: " . $out['match'] , $out['rest'] );
      return $out;
    }
    else { 

    }
  }
  die('Scan for closing ) exhausted buffer while searching. Brackets Missmatched. Fix this: \'' . $out['match'] . '\'');
}

# Given $function_name and $input, expects the form fnname(data) 
# 'data' can be any well balanced bracket sequence 
# also, brackets used for functions in the stream can be any of your choice, 
# as long as you're consistent. fnname[foo] will work. 

function parse_function_body( $input, $function_name ){ 
  $out = array ( 
    'success' => false, 
    'match' => '', 
    'rest' => '', 
  );

  debug( 'Parsing  ' . $function_name . "()", $input, "" , "" );

  $d = get_token( "(" . $function_name . '[({\[' . chr(60) . '])' , $input ); 

  if ( !$d['success'] ){ 
     die("Doom while parsing for function $function_name. Not Where its expected.");
  }

  $e = close_bracket( $d['rest'] , substr($d['match'],-1,1) );

  if ( !$e['success'] ){
    die("Found Imbalanced Brackets while parsing for $function_name, last snapshot was '" . $e['match'] . "'");
    return $out; # inbalanced brackets for function
  }
  $out['success'] = true;
  $out['match'] = $e['match']; 
  $out['rest'] = $e['rest'];
  debug( 'Finished Parsing  ' . $function_name . "()", $input, 'body:'. $out['match'] , $out['rest'] );

  return $out;
}

function  parse_query( $input ){ 

  $eat  = skip_space( $input ); 
  $get = parse_function_body( $eat['rest'] , 'get' ); 
  if ( !$get['success'] ){ 
    die("Get Token Malformed/Missing, instead found '" . $eat['rest'] . "'"); 
  }
  $eat = skip_space( $get['rest'] ); 
  $where = parse_function_body( $eat['rest'], 'where' ); 
  if ( !$where['success'] ){ 
    die("Where Token Malformed/Missing, instead found '" . $eat['rest'] . "'"); 
  }
  $eat = skip_space( $where['rest'] ); 
  $sort = parse_function_body( $eat['rest'], 'sort' ); 
  if( !$sort['success'] ){
    die("Sort Token Malformed/Missing, instead found '" . $eat['rest'] . "'"); 
  }
  return array( 
      'get' => $get['match'],
      'where' => $where['match'], 
      'sort' => $sort['match'], 
      '_Trailing_Data' =>  $sort['rest'],
  );
}



$structure = parse_query("get[max(fieldname1),min(fieldname2),fieldname3]where(something=something) sort(fieldname2 asc)");

print_r($structure);

$structure = parse_query("get(max(fieldname1),min(fieldname2),fieldname3)where(something=something) sort(fieldname2 asc)");

print_r($structure);

$structure = parse_query("get{max(fieldname1),min(fieldname2),fieldname3}where(something=something) sort(fieldname2 asc)");

print_r($structure);

$structure = parse_query("get" . chr(60) . "max(fieldname1),min(fieldname2),fieldname3" . chr(60). "where(something=something) sort(fieldname2 asc)");

print_r($structure);

以上所有print_r（$ structure）行应该产生这个：

Array
(
    [get] => max(fieldname1),min(fieldname2),fieldname3
    [where] => something=something
    [sort] => fieldname2 asc
    [_Trailing_Data] =>
)

正则表达式以获取括号内的文本

6 个答案: