想象一下这样的字符串:
field1,field2(subfield1),field3(subfield2,subfield3),field4(),field5(subfield4(subsubfield,subsubfield2))
我想得到一个这样的数组:
array(
field1 => array(),
field2 => array(subfield1),
field3 => array(
subfield2,
subfield3
),
field4 => array(),
field5 => array(
subfield4 => array(
subsubfield => array(),
subsubfield => array()
)
)
)
我有这个正则表达式[a-zA-Z0-9]*\([^()]*(?:(?R)[^()]*)*\)
,它可以完成一些输出工作:
array(
field1,
field2(subfield1),
field3(subfield2,subfield3),
field4(),
field5(subfield4(subsubfield,subsubfield2))
)
虽然这不是我想要的。我现在有点困惑,但到目前为止我提出的选择是:
对于它的价值,无论如何我必须遍历字段和子字段。 我有一些代码使用给定的正则表达式,并在以后需要时对其值运行相同的匹配。我想一次解析整个字符串,包括它的嵌套子字符串。
有谁知道我是如何开始这个的?哪种选择是最好(或更好)的方法? (可读性与资源使用率和复杂性对等)
答案 0 :(得分:4)
您描述的问题无法用常规语言表示,因为常规语言无法平衡括号。但是,多年来大多数正则表达式实现都添加了一些功能,这些功能允许解析比常规语言更复杂的语言。特别是,这个问题可以通过.NET的平衡匹配来解决,或者PCRE's recursive expressions(感谢@Gumbo,在评论中指出这一点)。
但是,仅仅因为你可以做某件事并不意味着你应该。使用正则表达式执行此类任务的问题在于,当您扩展元语言时,修改正则表达式会更加难以指定。而解析器往往更具塑性,易于扩展。
所以你可以构造一系列正则表达式来覆盖输入的非病态情况,但为什么甚至可以尝试编写解析器呢?它们易于维护,速度极快(比正则表达式更快),易于扩展,并且启动很有趣。
我最初错过了这个问题正在寻找PHP解决方案,所以我用JavaScript编写了它。我把它翻译成了PHP,并在帖子的最后留下了原始的JavaScript解决方案。
function parse( $s ) {
// we will always have a "current context". the current context is the array we're
// currently operating in. when we start, this is simply an empty array. as new
// arrays are created, this context will change.
$context = array();
// since we have to keep track of how deep our context is, we keep a context stack
$contextStack = array(&$context);
// this accumulates the name of the current array
$name = '';
for( $i=0; $i<strlen($s); $i++ ) {
switch( $s[$i] ) {
case ',':
// if the last array hasn't been added to the current context
// (as will be the case for arrays lacking parens), we add it now
if( $name!='' && !array_key_exists( $name, $context ) )
$context[$name] = array();
// reset name accumulator
$name = '';
break;
case '(':
// we are entering a subcontext
// save a new array in the current context; this will become our new context
$context[$name] = array();
// switch context and add to context stack
$context = &$context[$name];
$contextStack[] = &$context;
// reset name accumulator
$name = '';
break;
case ')':
// we are exiting a context
// if we haven't saved this array in the current context, do so now
if( $name!='' && !array_key_exists( $name, $context ) )
$context[$name] = array();
// we can't just assign $context the return value of array_pop because
// that does not return a reference
array_pop($contextStack);
if( count($contextStack) == 0 ) throw new Exception( 'Unmatched parenthesis' );
$context = &$contextStack[count($contextStack)-1];
// reset name accumulator
$name = '';
break;
default:
// this is part of the field name
$name .= $s[$i];
}
}
// add any trailing arrays to the context (this will cover the case
// where our input ends in an array without parents)
if( $name!='' && !array_key_exists( $name, $context ) )
$context[$name] = array();
if( count( $contextStack ) != 1 ) throw new Exception( 'Unmatched parenthesis' );
return array_pop( $contextStack );
}
function parse(s) {
var root = { parent: null, children: [] };
var field = { parent: root, name: '', start_idx: 0, children: [] };
root.children.push( field );
for( var i=0; i<s.length; i++ ) {
switch( s[i] ) {
case ',':
// if this field didn't have any children, we have to set its text
if( !field.children.length )
field.text = s.substr( field.start_idx, i - field.start_idx + 1 );
// start a new field; create new field and change context
var newfield = { parent: field.parent, name: '', start_idx: i, children:[] };
field.parent.children.push(newfield);
field = newfield;
break;
case '(':
// start of a subfield; create subfield and change context
var subfield = { parent: field, name: '', start_idx: i, children:[] };
field.children.push(subfield);
field = subfield;
break;
case ')':
// end of a subfield; fill out subfield details and change context
if( !field.parent ) throw new Error( 'Unmatched parenthesis!' );
field.text = s.substr( field.start_idx, i - field.start_idx + 1 );
if( field.text==='()' ) {
// empty subfield; pop this subfield so it doesn't clutter the parent
field.parent.children.pop();
}
field = field.parent;
break;
default:
// this is part of the field name
field.name += s[i];
field.name = field.name.trim();
}
}
return root;
}
现在我们已经有了一个解析过的语言树,我们可以很容易地创建一些递归代码来发出你的PHP:
function formatphp_namedarray(arr,indent,lastchild) {
var php = indent + arr.name + ' => array(';
if( arr.children.length ) {
if( arr.children.length===1 && arr.children[0].length===0 ) {
php += arr.children[0].name;
} else {
php += '\n';
indent += '\t';
for( var i=0; i<arr.children.length; i++ )
php += formatphp_namedarray(arr.children[i],indent,i===arr.children.length-1);
indent = indent.replace(/\t$/,'');
php += indent;
}
}
php += (lastchild?')':'),') + '\n';
return php;
}
function formatphp(t) {
var php = 'array(\n';
for( var i=0; i<t.children.length; i++ )
php += formatphp_namedarray( t.children[i], '\t', i===t.children.length-1 );
php += ')'
return php;
}
在此处查看所有内容:http://jsfiddle.net/6bguY/1/
答案 1 :(得分:1)
好的,我觉得这样做的诀窍:
$a = "field1,field2(subfield1),field3(subfield2,subfield3),field4(),field5(subfield4(subsubfield,subsubfield2,limit:50,offset:0))";
$output = array();
$outputStacktrace = array(&$output);
$depth = 0;
$buffer = $key = '';
$m = memory_get_usage();
for ($i = 0; $i < strlen($a); $i++)
if ($a[$i] == ':') {
$key = $buffer;
$buffer = '';
} elseif ($a[$i] == ',') {
if (strlen($buffer))
$outputStacktrace[$depth][$key ? $key : count($outputStacktrace[$depth])] = $buffer;
$buffer = $key = '';
} elseif ($a[$i] == '(') {
$outputStacktrace[$depth][$buffer] = array();
$outputStacktrace[$depth + 1] = &$outputStacktrace[$depth][$buffer];
$depth++;
$buffer = '';
} elseif ($a[$i] == ')') {
if (strlen($buffer))
$outputStacktrace[$depth][$key ? $key : count($outputStacktrace[$depth])] = $buffer;
$buffer = $key = '';
unset($outputStacktrace[$depth]);
$depth--;
} else
$buffer .= $a[$i];
var_dump($output);
所有这一切都在一个循环中。很高兴。非常感谢评论!
编辑包括固定变量名称,修复了最后一次非嵌套字符截断,添加了“特殊”查询元素:limit&amp;偏移量。
<?php
$output = array();
$outputStack = array(&$output);
$depth = 0;
$buffer = $key = '';
$s = count($a);
for ($i = 0; $i < $s; $i++)
if ($a[$i] == ':') {
$key = $buffer;
$buffer = '';
} elseif ($a[$i] == ',' || $a[$i] == ')' || $i == $s - 1) {
if ($depth < 4) {
if ($i == $s - 1)
$buffer .= $a[$i];
if (strlen($buffer))
if (strlen($key))
if (($key == 'limit' || $key == 'offset') && ((int) $buffer) . "" == $buffer)
$outputStack[$depth][$key] = $buffer;
else
$outputStack[$depth]['filters'][$key] = $buffer;
else
$outputStack[$depth]['fields'][] = $buffer;
}
$buffer = $key = '';
if ($a[$i] == ')') {
array_pop($outputStack);
$depth--;
}
} elseif ($a[$i] == '(') {
if ($depth + 1 < 4) {
$outputStack[$depth]['connections'][$buffer] = array('fields' => array('id'));
$outputStack[$depth + 1] = &$outputStack[$depth]['connections'][$buffer];
}
$depth++;
$buffer = '';
} else
$buffer .= $a[$i];
unset($outputStack, $depth, $buffer, $key, $a, $i);
答案 2 :(得分:0)
这是一个解决您问题的简单解析器:
function parse($input) {
preg_match_all('/[,()]|[^,()]+/', $input, $tokens);
// reference to current node list to work with during traversal
$ref = &$output;
// stack to remember node lists
$stack = array();
$output = array();
foreach ($tokens[0] as $token) {
switch ($token) {
case '(':
// push reference to current node list on the stack and
// update reference
$stack[] = &$ref;
$ref = &$ref[$last];
break;
case ')':
// restore previous node list from stack
$ref = &$stack[count($stack)-1];
array_pop($stack);
if (is_null($ref)) echo "error";
break;
case ',':
break;
default:
// insert token into current node list
$ref[$token] = array();
$last = $token;
break;
}
}
if (!empty($stack)) echo "error";
return $ouput
}
$input = 'field1,field2(subfield1),field3(subfield2,subfield3),field4(),field5(subfield4(subsubfield,subsubfield2))';
var_dump(parse($input));
请注意,这只是一个简单的解析器,它没有实现所描述语言的完整有限状态自动机,例如,,
之后的)
不得出现。
但是如果你愿意,你可以添加一些状态机制。只需将每个case
置于其自己的阶段,并检查允许进入新状态的先前状态。