我想通过PHP从HTML文档中删除所有注释和冗余空格(包括换行符)。
我尝试使用正则表达式,但正则表达式似乎不适合解析HTML文档。我也尝试过使用DOMDocument,但它似乎也剥离了IE的条件注释,这绝对是不需要的。此外,它不会删除换行符和JavaScript注释,也似乎不包含doctype。
目标是保存解析HTML文档所需的最少字节数。
我目前的方法如下:
使用正则表达式:
# Works quite well, but would also strip strings that look like comments.
$newHtml = preg_replace('/<!--\s*(?!\[\s*if\s|<\s*!\s*\[\s*endif\s*\]).*?-->/is', '', $oldHtml);
# Works, but would also strip intended whitespaces within <pre> elements
$newHtml = preg_replace('/\s+/', ' ', $oldHtml);
# Has one major side effect: JavaScript comments with double slashes (//)
# will lead to the rest of the script being commented as well.
$newHtml = preg_replace('/\r|\n/', '', $oldHtml);
使用DOMDocument:
$doc = new DOMDocument('5', 'UTF-8');
$doc->loadHTML($oldHtml);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//comment()') as $comment) {
# Also strips conditional comments for IE... uncool.
$comment->parentNode->removeChild($comment);
}
$newHtml = '<!DOCTYPE html>'; # Do I really need to do this manually?
$newHtml .= $doc->saveHTML($xpath->query('//html')->item(0));