Question

我想从de-full-html代码中删除一些html-body代码。

我使用下面的脚本。

<?php       
    function getbody($filename) {
      $file = file_get_contents($filename);

      $bodystartpattern = ".*<body>";
      $bodyendpattern = "</body>.*";

      $noheader = eregi_replace($bodystartpattern, "", $file);

      $noheader = eregi_replace($bodyendpattern, "", $noheader);

      return $noheader;
    }
    $bodycontent = getbody($_GET['url']);
?>

但在某些情况下，标签<body>并不存在，但标签可能是<body style="margin:0;">或其他内容。谁可以告诉我在这种情况下通过在$ bodystartpattern中使用正则表达式找到body-tag的解决方案是什么，该表达式用于查找结束 - “＆gt;”开放体标签？

Answer 1

@ 1nflktd我试过下面的代码。

<?php
    header('Content-Type:text/html; charset=UTF-8');

    function getbody($filename) {
        $file = file_get_contents($filename);       
        $dom = new DOMDocument;
        $dom->loadHTML($file);
        $bodies = $dom->getElementsByTagName('body');
        assert($bodies->length === 1);
        $body = $bodies->item(0);
        for ($i = 0; $i < $body->children->length; $i++) {
            $body->remove($body->children->item($i));
        }
        $stringbody = $dom->saveHTML($body);
        return $stringbody;
    }

    $url = "http://www.barcelona.com/";
    $bodycontent = getbody($url);
?>

<html>
<head></head>
<body>
<?php
    echo "BODY ripped from: ".$url."<br/>";
    echo "<textarea rows='40' cols='200' >".$bodycontent."</textarea>";
?>
</body>
</html>

Answer 2

为什么不使用html解析器？

function getbody($filename) {
  $file = file_get_contents($filename);

  $dom = new DOMDocument();
  libxml_use_internal_errors(true);
  $dom->loadHTML($file);
  libxml_use_internal_errors(false);
  $bodies = $dom->getElementsByTagName('body');
  assert($bodies->length === 1);
  $body = $bodies->item(0);
  for ($i = 0; $i < $body->children->length; $i++) {
      $body->remove($body->children->item($i));
  }
  $stringbody = $dom->saveHTML($body);
  return $stringbody;
}

DOM loadHTML reference

php从html页面获取body

2 个答案: