Question

我有一个网页，例如http://example.com/some-page。如果我将此URL传递给我的PHP函数，它应该获取页面的标题和内容。我试图抓住这样的标题：

function page_title($url) {
    $page = @file_get_contents($url);
    if (preg_match('~<h1 class="page-title">(.*)<\/h1>~is', $page, $matches)) {
        return $matches[0];
    }
}

echo page_title('http://example.com/some-page');

我的错误是什么？

Answer 1

你的功能几乎可以正常工作。我会提出DOM解析器解决方案（见下文），但在此之前我会指出正则表达式和代码中的一些弱点：

ApplicationData.current.localSettings捕获组是贪婪的，即它会在关闭(.*)之前捕获尽可能长的字符串，甚至跨越换行符（因为 s）修饰符）。因此，如果您的文档有多个</h1>标记，则会捕获到最后一个标记！您可以通过使其成为懒惰捕获来解决此问题：h1
实际页面可能在标题内包含其他标记，如(.*?)。您可能希望改进正则表达式以排除标题周围的任何标记，但PHP具有用于此目的的函数span。
确保实际检索到文件内容;错误可能阻止了正确的检索，或者您的服务器可能不允许这样的检索。当您使用strip_tags前缀来抑制错误时，您可能会错过它们。我建议删除@。您还可以检查 false 的返回值。
您确定要@标记内容吗？页面通常包含特定的H1代码。

上述改进将为您提供以下代码：

title

虽然这样做有效，但您迟早会遇到function page_title($url) { $page = file_get_contents($url); if ($page===false) { echo "Failed to retrieve $url"; } if (preg_match('~<h1 class="page-title">(.*?)<\/h1>~is', $page, $matches)) { return strip_tags($matches[0]); } }标记中有额外空格的文档，或者在h1之前有另一个属性，或者有多个css样式等......让比赛失败。以下正则表达式将处理其中一些问题：

class

...但仍然~<h1\s+class\s*=\s*"([^" ]* )?page-title( [^"]*)?"[^>]*>(.*?)<\/h1\s*>~is属性必须先于任何其他属性，并且其值必须用双引号括起来。这也可以解决，但正则表达式将成为一个怪物。

DOM方式

正则表达式不是从HTML中提取内容的理想方式。这是一个基于DOM解析的替代函数：

class

使用DOMXpath可以改善上述情况。

修改

您在评论中提到，您实际上并不想要function xpage_title($url) { // Create a new DOM Document to hold our webpage structure $xml = new DOMDocument(); // Load the url's contents into the DOM, ignore warnings libxml_use_internal_errors(true); $success = $xml->loadHTMLFile($url); libxml_use_internal_errors(false); if (!$success) { echo "Failed to open $url."; return; } // Find first h1 with class 'page-title' and return it's text contents foreach($xml->getElementsByTagName('h1') as $h1) { // Does it have the desired class? if (in_array('page-title', explode(" ", $h1->getAttribute('class')))) { return $h1->textContent; } } }标记的内容，因为它包含的文字比您想要的多。

然后，您可以阅读H1代码和title代码内容：

article

上面的代码将返回一个具有两个属性的对象： title 和 content 。请注意， content 属性将包含HTML标记，可能包含图像等。如果您不想要标签，请应用function page_title_and_content($url) { $page = file_get_contents($url); if ($page===false) { echo "Failed to retrieve $url"; } // PHP 5.4: $result = (object) ["title" => null, "content" => null]; $result = new stdClass(); $result->title = null; $result->content = null; if (preg_match('~\<title\>(.*?)\<\/title\>~is', $page, $matches)) { $result->title = $matches[1]; } if (preg_match('~<article>(.*)<\/article>~is', $page, $matches)) { $result->content = $matches[1]; } return $result; } $result = page_title_and_content('http://www.example.com/example'); echo "title: " . $result->title . "<br>"; echo "content: <br>" . $result->content . "<br>";。

如何获取网页的标题和内容

1 个答案: