Question

我已经阅读了很多，使用正则表达式并不是获取和操作html的最明智的方法，你应该使用DOMDocument。我已经重构了文档和here中的一些代码，并创建了两个函数来将the_content()拆分为文本和标记。第一个函数删除特定标记并返回没有标记的内容，第二个函数返回标记的内容而不包含其他内容

function get_content_without( $html, $tag )
{
    $dom = new DOMDocument;
    $dom->loadHTML( $html );

    $dom_x_path = new DOMXPath( $dom );
    while ($node = $dom_x_path->query( $tag )->item(0)) {
        $node->parentNode->removeChild( $node );
    }
    return $dom->saveHTML();
}

function get_html_tag_content( $html, $tag )
{
    $document = new DOMDocument();
    $document->loadHTML( $html );  

    $tags = [];
    $elements = $document->getElementsByTagName( $tag );
    if ( $elements ) {
        foreach ( $elements as $element ) {
            $tags[] = $document->saveHtml($element);
        }   
    }   
    return $tags;
}

概念证明：（此处我们将文本从a标记分开）

$html = '<a href="http://localhost/wordpress/image3/tags-sidebar/" rel="attachment wp-att-731">
        <img src="http://localhost/wordpress/wp-content/uploads/2014/12/tags-sidebar.jpg" alt="tags sidebar" width="318" height="792" class="alignright size-full wp-image-731" />
    </a>
    Cras malesuada turpis et augue feugiat, eget mollis tellus elementum. 
    Nunc posuere mattis arcu, ut varius ipsum molestie in. 
    Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; 
    Morbi ultricies tincidunt odio nec suscipit. Sed porttitor metus ut tincidunt interdum. 
    Etiam lobortis mollis augue at aliquam. Nunc venenatis elementum quam sed elementum. 
    Pellentesque congue pellentesque orci, vel convallis augue semper vitae';

?><pre><?php var_dump(get_html_tag_content($html, 'a')); ?></pre><?php  
?><pre><?php var_dump(get_content_without($html, '//a')); ?></pre><?php

我的问题是，是否有类似匹配和删除Wordpress中的短代码。功能的构建是Wordpress真的很糟糕，匹配所有的短代码。

我发现很多使用正则表达式的例子，但没有使用DOM。以下是两个短代码示例

[audio mp3="http://localhost/wordpress/wp-content/uploads/2014/09/Aha-The-Sun-Always-Shines-On-TV.mp3"][/audio]
[gallery ids="734,731,725,721"]

如何匹配音频短代码以及如何匹配图库短代码。这可能不使用正则表达式并使用DOM以及如何使用？

Answer 1

仅使用DOM来隔离短代码是不可能的。

字符[和]在HTML或XML中没有特殊含义。因此，对于DOM解析器，[shortcode]与上面示例文本中的ipsum没有区别。它只是文本节点的另一部分，因此找到它们的唯一方法是通过字符串函数，例如使用正则表达式。

Shadow DOM是基本原生HTML短代码的新兴标准。截至今天，原生支持为spotty。如果你想用DOM可解析的东西替换你的短代码，那么这就是你要走的路。

匹配没有正则表达式的短代码

1 个答案: