Question

我需要从字符串中检索未标记的内容。这就是输入的样子。

<!--[recognized]-->This is a recognized tag<!--[/recognized]-->
<!--[unrecognized]-->This is an unrecognized tag<!--[/unrecognized]-->
and this is normal text

拥有已识别标签的列表，我需要一些可爱且简单的方法来破坏“已识别”的标签和正常的文字，所以我可以有纯粹的无法识别的东西。

这就是我现在正在做的事情，但是你会看到我正在使用两个正则表达式。我希望它只是一个。

$recognized_tags    = implode( '|', array( 'recognized', 'foo', 'bar' ) );
$pattern            = '/<!--\[(?<tag>(' . $recognized_tags . '))\]-->(?<tag_content>.*)<!--\[\/\k<tag>\]-->/s';
$parcial_result     = preg_replace( $pattern, '', $text );

preg_match_all( '/<!--\[(?<tag>.+)\]-->(?<tag_content>.*)<!--\[\/\k<tag>\]-->/s', $parcial_result, $matches );
$result = implode( $matches[0] );

那么，你知道我怎么能只使用一个正则表达式呢？请注意，输入字符串可能会有所不同，并且有多个标记（已识别或未识别）。

很多！

Answer 1

编辑：要从无法识别的标签中找到内容:(即将推出）

旧回应：要查找未在标记之间包含的文本，您可以将此模式应用于原始$text字符串（之前不进行任何替换）：

$text = <<<'LOD'
<!--[recognized]-->This is a recognized tag<!--[/recognized]-->
<!--[unrecognized]-->This is an unrecognized tag<!--[/unrecognized]-->
<!--[atag]-->
    <!--[nested1]--> text
        <!--[nested2]-->text<!--[/nested2]-->
    <!--[/nested1]-->
<!--[/atag]-->
and this is normal text
LOD;

$pattern = '~(<!--\[([^]]++)]-->(?>[^<]++|(?1))*+<!--\[/\2]-->)*+\K[^<]++~';
preg_match_all($pattern, $text, $matches);

print_r($matches[0]);

模式细节：

~                       # delimiter
(                       # capturing group 1: will capture all tags with content inside
    <!--\[([^]]++)]-->  # the opening tag: the capturing group 2 contains the name of the tag
    (?>                 # atomic group: all possible content inside tags 
        [^<]++          # all characters except <
      |                 # OR
        (?1)            # an other tag: recursion to the capturing group 1
    )*+                 # repeat zero or more times the atomic group
    <!--\[/\2]-->       # the closing tag with a backreference to the 2nd capturing group
)*+                     # repeat zero or more times the capturing group 1
\K                      # IMPORTANT: the \K resets all the precedent match from match result before itself
[^<]++                  # the result: all characters that are not a <
~

此模式的一般概念是匹配“普通文本”之前的所有潜在标记，然后使用\K功能从匹配结果中重置此部分。

注意：要避免空白结果并修剪前导空格，可以将其添加到模式中：

$pattern = '~(?>\s++|(<!--\[([^]]++)]-->(?>[^<]++|(?1))*+<!--\[/\2]-->))*+\K[^<]++~';

使用php过滤无法识别的标签

1 个答案: