Question

如果给出URL，您如何解析某个网页的源代码？我想从源代码中找到作者，标题以及上次修改的时间。

我的想法是使用file_get_contents（）解析源代码。然后，对于作者，我将查看源代码中的＆lt; meta name =＆＃34; author＆＃34;含量=＆＃34; [...]＆＃34; ＆GT;然后提取内容中的内容。对于标题，我要寻找＆lt;标题＆gt; [...]＆lt; / title＆gt;并提取内部。我不确定在上次修改时我会做什么。

这些方法有用吗？还有更好的方法吗？

Answer 1

您可以使用file_get_contents。

例如：

$content = file_get_contents('http://www.external-site.com/page.php');

然后变量$ content将拥有外部站点的内容。

Answer 2

您需要解析DOM

尝试使用像这样的解析器：http://simplehtmldom.sourceforge.net/

Answer 3

使用curl（当“allow_url_fopen”指令为false并且更灵活时，它仍然有效。）

要解析网页源代码，请使用DOM库，但在加载HTML内容之前应禁用libxml错误输出。

示例：

<?php
$url = 'http://stackoverflow.com/';

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
$content = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE); 
curl_close($ch);
if( $content === null || $httpCode >= 400 ) {
    die();
}

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($content);

$title = null;
$titleNodes = $dom->getElementsByTagName('title');
if( $titleNodes->length === 1 ) {
    $title = $titleNodes->item(0)->textContent;
}

解析网页的源给定URL

3 个答案: