Question

可能重复：
Parse Website for URLs

如何使用PHP获取网页中的所有链接？

我需要获取链接列表： -

Google

我想获取 href （http://www.google.com）和文字（Google）

-------------------情况是： -

我正在构建一个爬虫，我想让它获取数据库表中存在的所有链接。

Answer 1

有几种方法可以做到这一点，但我接近这个的方式如下，

使用cURL获取页面，即：

// $target_url has the url to be fetched, ie: "http://www.website.com"
// $userAgent should be set to a friendly agent, sneaky but hey... 

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);

 $ch = curl_init();
 curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
 curl_setopt($ch, CURLOPT_URL,$target_url);
 curl_setopt($ch, CURLOPT_FAILONERROR, true);
 curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
 curl_setopt($ch, CURLOPT_AUTOREFERER, true);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
 curl_setopt($ch, CURLOPT_TIMEOUT, 10);
 $html = curl_exec($ch);
 if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
 }

如果一切顺利，页面内容现在都是$ html。

让我们继续并在DOM对象中加载页面：

$dom = new DOMDocument();
@$dom->loadHTML($html);

到目前为止一直很好，XPath救援以从DOM对象中删除链接：

$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

循环结果并获取链接：

for ($i = 0; $i < $hrefs->length; $i++) {
 $href = $hrefs->item($i);
 $link = $href->getAttribute('href');
 $text = $href->nodeValue

     // Do what you want with the link, print it out:
     echo $text , ' -> ' , $link;

    // Or save this in an array for later processing..
    $links[$i]['href'] = $link;
    $links[$i]['text'] = $text;                         
}

$ hrefs是DOMNodeList类型的对象，item（）返回指定索引的DOMNode对象。所以基本上我们有一个循环，它将每个链接检索为DOMNode对象。

这应该为你做的很多。我不能100％确定的唯一部分是，如果链接是图像或锚点，在这些条件下会发生什么，我不知道所以你需要测试并过滤掉那些。

希望这能让您了解如何抓取链接，快乐编码。

如何在PHP的网页中获取链接列表？

1 个答案: