Question

我实际上已经找到了一个可行的解决方案，它的名为regex。是的，我知道，有数不清的时间不使用正则表达式进行HTML解析。但正如标题所说的那样，这取决于内部HTML文本，它需要遵循某种模式。所以我还是需要使用正则表达式！我首先尝试使用DOM库，但我失败了。

所以我的实际问题是这个问题是否有最佳实践？无论如何，这就是我所拥有的：

之前的

HTML：

<section> 
    {foo:bar}
</section>

PHP：

// I'm not a regex ninja, but this seems to do the job

$regexTag = "/<(?!body|head|html|link|script|\!|\/)(\w*)[^>]*>[^{]*{\s*[^>]*:\s*[^>]*\s*[^}]}/";
// $match[0] "<section> {foo:bar}"
// $match[1] "section"


preg_match_all($regexTag,$html, $match); 


for ($i=0; $i < sizeof($match[0]); $i++) { 
    $pos = (strlen($match[1][$i])+1);
    $str = substr_replace($match[0][$i], " class='foo'", $pos, 0);
    $html = str_replace($match[0][$i], $str, $html);
}

HTML之后：

<section class='foo'> 
    {foo:bar}
</section>

Answer 1

正则表达式不是这项工作的正确工具。坚持使用DOM解析器方法。这是使用DOMDocument类的快速解决方案。

使用getElementsByTagName('*')获取所有标记，然后使用in_array()检查标记名称是否在不允许的标记列表中。

然后使用带preg_match()的正则表达式检查文本内容是否遵循{foo:bar}模式。如果是，请逐个添加新属性setAttribute()方法：

// An array containing all attributes
$attrs = [
    'class' => 'foo'
    /* more attributes & values */
];

$ignored_tags = ['body', 'head', 'html', 'link', 'script'];

$dom = new DOMDocument;
$dom->loadXML($html);

foreach ($dom->getElementsByTagName('*') as $tag) 
{
    // If not a disallowed tag
    if (!in_array($tag->tagName, $ignored_tags)) 
    {
        $textContent = trim($tag->textContent);

        // If $textContent matches the format '{foo:bar}'
        if (preg_match('#{\s*[^>]*:\s*[^>]*\s*[^}]}#', $textContent)) 
        {
            foreach ($attrs as $attr => $val)
                $tag->setAttribute($attr, $val);
        }
    }
}

echo $dom->saveHTML();

输出：

<section class="foo"> 
    {foo:bar}
</section>

Answer 2

所以这是有效的

$elements = $dom->getElementsByTagName('body')->item(0)->childNodes;

for ($i = $elements->length-1; $i >= 0; $i--) { 
   $element = $elements->item($i); 
   $tag =  $element->nodeName;

   foreach ($dom->getElementsByTagName($tag) as $tag) {
       ...

我不知道，我仍然觉得正则表达更舒服，哈哈。但我想这是要走的路。

使用PHP根据内部模式将属性添加到HTML标记

2 个答案: