Question

我偶然发现了PHP的preg_replace函数和一些正则表达式模式的奇怪错误。我要做的是替换括号分隔的自定义标签并将其转换为HTML。正则表达式必须考虑自定义的“填充”标记，这些标记将保留输出的HTML，以便在页面加载时可以即时替换（例如替换为站点名称）。

每个正则表达式模式都可以自行工作，但由于某些原因，如果先检查其他模式之一，它们中的一些将提前退出函数。当我偶然发现这一点时，我使用了preg_match和一个foreach循环来检查模式，然后再找到结果 - 如果找到的话会返回结果 - 所以假设每个模式看起来都很新鲜。

这也不起作用。

检查代码：

function replaceLTags($originalString){
    $patterns = array(
                '#^\[l\]([^\s]+)\[/l\]$#i' => '<a href="$1">$1</a>',
                '#^\[l=([^\s]+)]([^\[]+)\[/l\]$#i'=> '<a href="$1">$2</a>',
                '#^\[l=([^\s]+) title=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" title="$2">$3</a>',
                '#^\[l=([^\s]+) rel=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" rel="$2">$3</a>',
                '#^\[l=([^\s]+) onClick=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" onClick="$2">$3</a>',
                '#^\[l=([^\s]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" style="$2">$3</a>',
                '#^\[l=([^\s]+) onClick=([^\[]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" onClick="$2" style="$3">$4</a>',
                '#^\[l=([^\s]+) class=([^\[]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" class="$2" style="$3">$4</a>',
                '#^\[l=([^\s]+) class=([^\[]+) rel=([^\[]+)] target=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" class="$2" rel="$3" target="$4">$5</a>'
            );

    foreach ($patterns as $pattern => $replace){
        if (preg_match($pattern, $originalString)){
            return preg_replace($pattern, $replace, $originalString);
        }
    }
}

$string = '[l=[site_url]/site-category/ class=hello rel=nofollow target=_blank]Hello there[/l]';

echo $alteredString = $format->replaceLTags($string);

以上“字符串”将显示为：

<a href="[site_url">/site-category/ class=hello rel=nofollow target=_blank]Hello there</a>

什么时候应该出现：

<a href="[site_url]/site-category/" class="hello" rel="nofollow" target="_blank">Hello there</a>

但是如果要在列表中进一步向上移动该模式以便更快地进行检查，那么它的格式正确。

我很难过，因为看起来每次检查字符串都会被覆盖，即使这没有任何意义。

Answer 1

对我来说，你做的工作比你需要做的多得多。为什么不使用preg_replace_callback在单独的步骤中处理属性，而不是为每个可能的属性列表使用单独的正则表达式/替换？例如：

function replaceLTags($originalString){
  return preg_replace_callback('#\[l=((?>[^\s\[\]]+|\[site_url\])+)([^\]]*)\](.*?)\[/l\]#',
                               replaceWithinTags, $originalString);
}

function replaceWithinTags($groups){
  return '<a href="' . $groups[1] . '"' . 
         preg_replace('#(\s+\w+)=(\S+)#', '$1="$2"', $groups[2]) .
         '>' . $groups[3] . '</a>';
}

查看完整演示 here （已更新;请参阅评论）。

以下是基于评论中提供的新信息的代码更新版本：

function replaceLTags($originalString){
  return preg_replace_callback('#\[l=((?>[^\s\[\]]+|\[\w+\])+)([^\]]*)\](.*?)\[/l\]#',
                               replaceWithinTags, $originalString);
}

function replaceWithinTags($groups){
  return '<a href="' . $groups[1] . '"' . 
         preg_replace(
             '#(\s+[^\s=]+)\s*=\s*([^\s=]+(?>\s+[^\s=]+)*(?!\s*=))#',
             '$1="$2"', $groups[2]) .
         '>' . $groups[3] . '</a>';
}

<强> demo

在第一个正则表达式中，我将[site_url]更改为\[\w+\]，以便它可以匹配任何自定义填充标记。

以下是第二个正则表达式的细分：

(\s+[^\s=]+)   # the attribute name and its leading whitespace
\s*=\s*
(
  [^\s=]+   # the first word of the attribute value
  (?>\s+[^\s=]+)*  # the second and subsequent words, if any
  (?!\s*=)  # prevents the group above from consuming tag names
)

最棘手的部分是匹配多字属性值。 (?>\s+[^\s=]+)*将始终使用下一个标记名称（如果有），但前瞻会强制它回溯。通常情况下，它一次只会退回一个字符，但是原子组会有效地强制它以整个单词回溯或根本不回溯。

Answer 2

你搞砸了正则表达式。如果在每次迭代时将字符串打印为：

foreach ($patterns as $pattern => $replace){
    echo "String: $originalString\n";
    if (preg_match($pattern, $originalString)){
        return preg_replace($pattern, $replace, $originalString);
    }
}

您将看到该字符串未被修改。从我的运行中，我注意到第二个正则表达式匹配。我在preg_match电话中放了第三个参数并打印了比赛。这是我得到的：

Array (
    [0] => [l=[site_url]/site-category/ class=hello rel=nofollow target=_blank]Hello there[/l]
    [1] => [site_url
    [2] => /site-category/ class=hello rel=nofollow target=_blank]Hello there )

Answer 3

眼前问题的原因有两个：

首先，适用的正则表达式中存在拼写错误（数组中的最后一个）。它在" target="之前有一个无关的文字右方括号。换句话说，这个：

'#^\[l=([^\s]+) class=([^\[]+) rel=([^\[]+)] target=([^\[]+)]([^\[]+)\[/l\]$#i'

应阅读：

'#^\[l=([^\s]+) class=([^\[]+) rel=([^\[]+) target=([^\[]+)]([^\[]+)\[/l\]$#i'

其次，数组中有两个正则表达式，它们都匹配相同的字符串，不幸的是，两者中更具体的正则（上面的正则表达式是我们想要的那个），排在第二位。匹配的另一个更通用的正则表达式是数组中的第二个：

'#^\[l=([^\s]+)]([^\[]+)\[/l\]$#i'

最后放置更一般的正则表达式并删除无关的方括号可以解决问题。以下是您应用上述两项更改修复的原始代码：

function replaceLTags($originalString){
    $patterns = array(
                '#^\[l\]([^\s]+)\[/l\]$#i' => '<a href="$1">$1</a>',
                '#^\[l=([^\s]+) title=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" title="$2">$3</a>',
                '#^\[l=([^\s]+) rel=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" rel="$2">$3</a>',
                '#^\[l=([^\s]+) onClick=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" onClick="$2">$3</a>',
                '#^\[l=([^\s]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" style="$2">$3</a>',
                '#^\[l=([^\s]+) onClick=([^\[]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" onClick="$2" style="$3">$4</a>',
                '#^\[l=([^\s]+) class=([^\[]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" class="$2" style="$3">$4</a>',
                '#^\[l=([^\s]+) class=([^\[]+) rel=([^\[]+) target=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" class="$2" rel="$3" target="$4">$5</a>',
                '#^\[l=([^\s]+)]([^\[]+)\[/l\]$#i'=> '<a href="$1">$2</a>'
            );

    foreach ($patterns as $pattern => $replace){
        if (preg_match($pattern, $originalString)){
            return preg_replace($pattern, $replace, $originalString);
        }
    }
}

$string = '[l=[site_url]/site-category/ class=hello rel=nofollow target=_blank]Hello there[/l]';

echo $alteredString = $format->replaceLTags($string);

请注意，这仅修复了问题中描述的直接特定错误，并未解决您尝试完成的某些更基本的问题。作为对后续问题的回答，我提出了一个更好的解决方案：How do I make this REGEX ignore = in a tag's attribute?。

但正如其他人所提到的，将两种不同的标记语言混合在一起并使用正则表达式处理是一件麻烦事。

Answer 4

以下是一些可用于表达较少的通用代码，您可以随时删除最终字符串中不允许的任何标记。

<?php

function replaceLTags($originalString) {
    if (preg_match('#^\[l\]([^\s]+)\[/l\]$#i', $originalString)) {
        // match a link with no description or tags
        return preg_replace('#^\[l\]([^\s]+)\[/l\]$#i', '<a href="$1">$1</a>', $originalString);
    } else if (preg_match('#^\[l=([^\s]+)\s*([^\]]*)\](.*?)\[/l\]#i', $originalString, $matches)) {
        // match a link with title and/or tags
        $attribs = $matches[2];
        $attrStr = '';
        if (preg_match_all('#([^=]+)=([^\s\]]+)#i', $attribs, $attribMatches) > 0) {
            $attrStr = ' ';
            for ($i = 0; $i < sizeof($attribMatches[0]); ++$i) {
                $attrStr .= $attribMatches[1][$i] . '="' . $attribMatches[2][$i] . '" ';
            }
            $attrStr = rtrim($attrStr);
        }

        return '<a href="' . $matches[1] . '"' . $attrStr . '>' . $matches[3] . '</a>';
    } else {
        return $originalString;
    }
}

$strings = array(
    '[l]http://www.stackoverflow.com[/l]',
    '[l=[site_url]/site-category/ class=hello rel=nofollow target=_blank]Hello there[/l]',
    '[l=[site_url]/page.php?q=123]Link[/l]',
    '[l=http://www.stackoverflow.com/careers/ target=_blank class=default]Stack overflow[/l]'
);

foreach($strings as $string) {
    $altered = replaceLTags($string);
    echo "{$altered}<br />\n";
}

PHP PREG_REPLACE根据检查的顺序返回错误的结果

4 个答案: