我已将博客帐户中的内容导入Wordpress博客。
我不得不应用一些xpath和regex来删除一些讨厌的格式。
global $post;
$html = mb_convert_encoding($content, 'HTML-ENTITIES', "UTF-8");
$doc = new DOMDocument();@$doc - > loadHTML($html);
$xpath = new DOMXPath($doc);
foreach($xpath - > query('//br[not(preceding::text())]') as $node) {
$node - > parentNode - > removeChild($node);
}
$nodes = $xpath - > query('//a[string-length(.) = 0]');
foreach($nodes as $node) {
$node - > parentNode - > removeChild($node);
}
$nodes = $xpath - > query('//*[not(text() or node() or self::br)]');
foreach($nodes as $node) {
$node - > parentNode - > removeChild($node);
}
remove_filter('the_content', 'wpautop');
$content = $doc - > saveHTML();
$content = ltrim($content, '<br>');
$content = strip_tags($content, '<br> <a> <iframe>');
$content = preg_replace(array('/(<br\s*\/?>\s*){1,}/'), array('<br/><br/>'), $content);
$content = str_replace(' ', ' ', $content);
$content = "<p>".implode("</p>\n\n<p>", preg_split('/\n(?:\s*\n)+/', $content))."</p>";
return $content;
出于某种原因,虽然在我的页面中打印了随机的DOCTYPE,但我不知道为什么。
<p>!DOCTYPE html PUBLIC “-//W3C//DTD HTML 4.0 Transitional//EN” “http://www.w3.org/TR/REC-html40/loose.dtd”>
<br/>
<br/>When the battle is on between contestants in a talent show, it gets really competitive when down to the last four. X-FactorUSAcontestant Marcus Canty knows this all too well as this is the stage he was voted off of the show earlier this year.
<br/>
<br/>
</p>
有人能指出我为何会发生这种情况的方向吗?
答案 0 :(得分:4)
当您使用DOMDocument加载一段html代码时,会自动添加一个Doctype,一个html,head和body标签(如果缺少)到这段html(并且关闭未关闭的标签)以使其成为“有效” HTML文档。因此,当您使用saveHTML时,您可以保存所有这些内容。如果我记得很清楚,你可以在PHP手册中找到一些技巧来避免这种情况(在帖子中)