PHP cURL解析网页上的链接

时间:2013-04-13 20:14:36

标签: php curl

所以这是我正在使用的脚本。它存储指定网页上的所有链接,但我想知道的是我如何制作它所以它只存储网页某个部分的链接。我想要抓取的网站上的页面部分是一个blockqoute,其id为“toc_rows”。

这是我的PHP代码:

     function storeLink($url,$gathered_from) {
    $query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
    mysql_query($query) or die('Error, insert query failed');
}
$target_url = "http://milwaukee.craigslist.org/cpg/";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
    echo "<br />cURL error number:" .curl_errno($ch);
    echo "<br />cURL error:" . curl_error($ch);
    exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');
    storeLink($url,$target_url);
    echo "<br />Link stored: $url";
}

0 个答案:

没有答案