PHP - 剥离注释和冗余空格 - 最佳实践

时间:2015-05-28 10:04:32

标签: php regex html-parsing domdocument

我想通过PHP从HTML文档中删除所有注释和冗余空格(包括换行符)。

我尝试使用正则表达式,但正则表达式似乎不适合解析HTML文档。我也尝试过使用DOMDocument,但它似乎也剥离了IE的条件注释,这绝对是不需要的。此外,它不会删除换行符和JavaScript注释,也似乎不包含doctype。

目标是保存解析HTML文档所需的最少字节数。

我目前的方法如下:

使用正则表达式

# Works quite well, but would also strip strings that look like comments.
$newHtml = preg_replace('/<!--\s*(?!\[\s*if\s|<\s*!\s*\[\s*endif\s*\]).*?-->/is', '', $oldHtml);

# Works, but would also strip intended whitespaces within <pre> elements
$newHtml = preg_replace('/\s+/', ' ', $oldHtml);

# Has one major side effect: JavaScript comments with double slashes (//)
# will lead to the rest of the script being commented as well.
$newHtml = preg_replace('/\r|\n/', '', $oldHtml);

使用DOMDocument

$doc   = new DOMDocument('5', 'UTF-8');
$doc->loadHTML($oldHtml);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//comment()') as $comment) {
    # Also strips conditional comments for IE... uncool.
    $comment->parentNode->removeChild($comment);
}
$newHtml  = '<!DOCTYPE html>'; # Do I really need to do this manually?
$newHtml .= $doc->saveHTML($xpath->query('//html')->item(0));

0 个答案:

没有答案