我想从de-full-html代码中删除一些html-body代码。
我使用下面的脚本。
<?php
function getbody($filename) {
$file = file_get_contents($filename);
$bodystartpattern = ".*<body>";
$bodyendpattern = "</body>.*";
$noheader = eregi_replace($bodystartpattern, "", $file);
$noheader = eregi_replace($bodyendpattern, "", $noheader);
return $noheader;
}
$bodycontent = getbody($_GET['url']);
?>
但在某些情况下,标签<body>
并不存在,但标签可能是<body style="margin:0;">
或其他内容。谁可以告诉我在这种情况下通过在$ bodystartpattern中使用正则表达式找到body-tag的解决方案是什么,该表达式用于查找结束 - “&gt;”开放体标签?
答案 0 :(得分:3)
@ 1nflktd我试过下面的代码。
<?php
header('Content-Type:text/html; charset=UTF-8');
function getbody($filename) {
$file = file_get_contents($filename);
$dom = new DOMDocument;
$dom->loadHTML($file);
$bodies = $dom->getElementsByTagName('body');
assert($bodies->length === 1);
$body = $bodies->item(0);
for ($i = 0; $i < $body->children->length; $i++) {
$body->remove($body->children->item($i));
}
$stringbody = $dom->saveHTML($body);
return $stringbody;
}
$url = "http://www.barcelona.com/";
$bodycontent = getbody($url);
?>
<html>
<head></head>
<body>
<?php
echo "BODY ripped from: ".$url."<br/>";
echo "<textarea rows='40' cols='200' >".$bodycontent."</textarea>";
?>
</body>
</html>
答案 1 :(得分:2)
为什么不使用html解析器?
function getbody($filename) {
$file = file_get_contents($filename);
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($file);
libxml_use_internal_errors(false);
$bodies = $dom->getElementsByTagName('body');
assert($bodies->length === 1);
$body = $bodies->item(0);
for ($i = 0; $i < $body->children->length; $i++) {
$body->remove($body->children->item($i));
}
$stringbody = $dom->saveHTML($body);
return $stringbody;
}