Question

我正在尝试使用WordPress进行自动博客（即RSS驱动的博客发布），而所有缺少的内容都是自动填充帖子内容的组件，其中RSS的URL链接到的内容（RSS无关紧要）解决方案）。

使用标准PHP 5，我如何创建一个名为 fetchHTML （[URL]）的函数，该函数返回在<body>...</body>标签之间找到的网页的HTML内容？

如果有任何先决条件“包含”，请告诉我。感谢。

Answer 1

好的，这是请求的DOM解析器代码示例。

<?php

function fetchHTML( $url )
  {

  $content = file_get_contents($url);

  $html=new DomDocument();
  $body=$html->getelementsbytagname('body');
  foreach($body as $b){ $content=$b->textContent; break; }//hmm, is there a better way to do that?
  return $content;
  }

Answer 2

假设它始终是<body>而不是<BODY>或<body style="width:100%">或除<body>和</body>之外的任何内容，并且需要注意的是“使用正则表达式来解析HTML，即使我即将，在这里你去：

<?php

function fetchHTML( $url )
{
    $feed = '<body>Lots of stuff in here</body>';

    $content = file_get_contents( $url );

    preg_match( '/<body>([\s\S]{1,})<\/body>/m', $content, $match );

    $content = $match[1];

    return $content;


} // fetchHTML
?>

如果您echo fetchHTML([some url]);，您将获得正文标记之间的html。

请注意原始警告。

Answer 3

我认为你最好使用像SimpleDom这样的类 - ＆gt; http://sourceforge.net/projects/simplehtmldom/提取数据，因为您不需要编写如此复杂的正则表达式

简单的PHP屏幕刮擦功能

3 个答案: