Question

我一直在努力为我正在开发的HTML标记语言创建基本文本的解析器。内联元素标记如下。

{*strong*}
{/emphasis/}
{-strikethrough-}
{>small<}
{|code|}

我正在测试的示例字符串是：

tëstïng 汉字/漢字 testing {*strông{/ëmphäsïs{-strïkë{|côdë|}-}/}*} {*wôw*} 1, 2, 3

使用preg_split我可以将其转换为：

$split = preg_split('%(\{.(?:[^{}]+|(?R))+.\})%',
    $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

array (size=5)
  0 => string 'tëstïng 汉字/漢字 testing ' (length=32)
  1 => string '{*strông{/ëmphäsïs{-strïkë{|côdë|}-}/}*}' (length=48)
  2 => string ' ' (length=1)
  3 => string '{*wôw*}' (length=8)
  4 => string ' 1, 2, 3' (length=8)

然后循环浏览$dom->createTextNode()或$dom->createElement() + $dom->appendChild($dom->createTextNode())。不幸的是，当嵌套标记时，这没有用。

我只是以一种有效的方式将我的标记递归处理到DOMDocument中。我一直在阅读，我需要编写一个解析器，但找不到合适的教程或代码示例，特别是在使用DOMDocument将其与元素和文本节点创建集成时。

Answer 1

嵌套或递归结构通常超出了正则表达式解析的能力，你通常需要一个更强大的解析器。问题是你需要根据以前的令牌找到下一个令牌，这不是正则表达式可以处理的（语言不再是常规的）。

然而，对于这样一种简单的语言，你不需要一个带有正式语法的完整的解析器生成器 - 你可以轻松地手工编写一个简单的解析器。你只有一点重要的状态 - 最后打开的标签。如果您的正则表达式与文本，新的打开标记或当前打开的标记的相应关闭标记匹配，则可以处理此任务。规则是：

如果您匹配文字，请保存文字并继续匹配。
如果您匹配开放标记，请保存打开的标记，然后继续匹配，直到找到开放标记或相应的关闭标记。
如果您匹配关闭标记，请停止查找当前打开的标记，并继续匹配最后未公开的标记，文本或其他打开标记。

第二步是递归的 - 每当你找到一个新的开放标记时，你就会创建一个新的匹配上下文来查找相应的关闭标记。

这不是必需的，但通常解析器将生成一个简单的树结构来表示已解析的文本 - 这称为抽象语法树。在生成语法表示的内容之前，通常最好先生成语法树。这使您可以灵活地操作树或生成不同的输出（例如，您可以输出除xml之外的其他内容。）

这是一个结合了这些想法并解析您的文本的解决方案。（它还将{{或}}视为转义序列，表示单个文字{或}。）

首先是解析器：

class ParseError extends RuntimeException {}

function str_to_ast($s, $offset=0, $ast=array(), $opentag=null) {
    if ($opentag) {
        $qot = preg_quote($opentag, '%');
        $re_text_suppl = '[^{'.$qot.']|{{|'.$qot.'[^}]';
        $re_closetag = '|(?<closetag>'.$qot.'\})';
    } else {
        $re_text_suppl = '[^{]|{{';
        $re_closetag = '';
    }
    $re_next = '%
        (?:\{(?P<opentag>[^{\s]))  # match an open tag
              #which is "{" followed by anything other than whitespace or another "{"
        '.$re_closetag.'  # if we have an open tag, match the corresponding close tag, e.g. "-}"
        |(?P<text>(?:'.$re_text_suppl.')+) # match text
            # we allow non-matching close tags to act as text (no escape required)
            # you can change this to produce a parseError instead
        %ux';
    while ($offset < strlen($s)) {
        if (preg_match($re_next, $s, $m, PREG_OFFSET_CAPTURE, $offset)) {
            list($totalmatch, $offset) = $m[0];
            $offset += strlen($totalmatch);
            unset($totalmatch);
            if (isset($m['opentag']) && $m['opentag'][1] !== -1) {
                list($newopen, $_) = $m['opentag'];
                list($subast, $offset) = str_to_ast($s, $offset, array(), $newopen);
                $ast[] = array($newopen, $subast);
            } else if (isset($m['text']) && $m['text'][1] !== -1) {
                list($text, $_) = $m['text'];
                $ast[] = array(null, $text);
            } else if ($opentag && isset($m['closetag']) && $m['closetag'][1] !== -1) {
                return array($ast, $offset);
            } else {
                throw new ParseError("Bug in parser!");
            }
        } else {
            throw new ParseError("Could not parse past offset: $offset");
        }
    }
    return array($ast, $offset);
}

function parse($s) {
    list($ast, $offset) = str_to_ast($s);
    return $ast;
}

这将生成一个抽象语法树，它是一个“节点”列表，其中每个节点都是array(null, $string)形式的数组，用于文本或array('-', array(...))（即类型代码和另一个列表标签内的东西。

一旦你拥有了这棵树，你可以随心所欲地做任何事情。例如，我们可以递归遍历它以生成DOM树：

function ast_to_dom($ast, DOMNode $n = null) {
    if ($n === null) {
        $dd = new DOMDocument('1.0', 'utf-8');
        $dd->xmlStandalone = true;
        $n = $dd->createDocumentFragment();
    } else {
        $dd = $n->ownerDocument;
    }
    // Map of type codes to element names
    $typemap = array(
        '*' => 'strong',
        '/' => 'em',
        '-' => 's',
        '>' => 'small',
        '|' => 'code',
    );

    foreach ($ast as $astnode) {
        list($type, $data) = $astnode;
        if ($type===null) {
            $n->appendChild($dd->createTextNode($data));
        } else {
            $n->appendChild(ast_to_dom($data, $dd->createElement($typemap[$type])));
        }
    }
    return $n;
}

function ast_to_doc($ast) {
    $doc = new DOMDocument('1.0', 'utf-8');
    $doc->xmlStandalone = true;
    $root = $doc->createElement('body');
    $doc->appendChild($root);
    ast_to_dom($ast, $root);
    return $doc;
}

以下是一些测试代码更加困难的测试代码：

$sample = "tëstïng 汉字/漢字 {{ testing -} {*strông 
    {/ëmphäsïs {-strïkë *}also strike-}/} also {|côdë|}
    strong *} {*wôw*} 1, 2, 3";
$ast = parse($sample);
echo ast_to_doc($ast)->saveXML();

这将打印以下内容：

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<body>tëstïng 汉字/漢字 {{ testing -} <strong>strông 
    <em>ëmphäsïs <s>strïkë *}also strike</s></em> also <code>côdë</code>
    strong </strong> <strong>wôw</strong> 1, 2, 3</body>

如果您已经有DOMDocument，并且想要为其添加一些已解析的文字，我建议您创建一个DOMDocumentFragment并将其直接传递给ast_to_dom，然后将其附加到您想要的容器元素。

Answer 2

如果你有一个捕获最外面的打开/关闭对的内容的正则表达式，那么你可以将捕获的内容包装在等效的HTML标记中，然后通过重复相同的正则表达式来递归到新的字符串（这将捕获第二到最外对的内容，等等。

这种方法的问题在于，如果/当一个开放的“标签”没有正确关闭时，整个内容就会丢失，然后就无法递归到它中。

更可靠的方法可能是从头到尾解析文本，当遇到开始标记时，将其及其位置添加到堆栈中。每当遇到结束标记时，如果它与栈顶部的开始标记不匹配，或者如果它匹配，则将其忽略，然后将当前结束标记替换为等效的HTML结束标记，并弹出开始标记。堆栈（并将其替换为记录位置的等效开始HTML标记）。

一个简单的解析算法可能是找到你的开始或结束标签的第一个实例（例如使用这个正则表达式(\{[-*/>|])|(\}[-*/<|])），然后如上所述进行处理，然后从当前位置重复搜索以找到下一个标签等...

使用正则表达式和DOMDocument递归处理标记

2 个答案: