我正在尝试制作一个脚本,用于加载所需的网址(由用户输入),并检查该网页是否在我的网站上发布域名之前链接回我的域名。我对正则表达式不是很熟悉,这是我到目前为止所做的:
$loaded = file_get_contents('http://localhost/small_script/page.php');
// $loaded will be equal to the users site they have submitted
$current_site = 'site2.com';
// $current_site is the domain of my site, this the the URL that must be found in target site
$matches = Array();
$find = preg_match_all('/<a(.*?)href=[\'"](.*?)[\'"](.*?)\b[^>]*>(.*?)<\/a>/i', $loaded, $matches);
$c = count($matches[0]);
$z = 0;
while($z<$c){
$full_link = $matches[0][$z];
$href = $matches[2][$z];
$z++;
$check = strpos($href,$current_site);
if($check === false) {
}else{
// The link cannot have the "no follow" tag, this is to check if it does and if so, return a specific error
$pos = strpos($full_link,'no follow');
if($pos === false) {
echo $href;
}
else {
//echo "rel=no follow FOUND";
}
}
}
正如你所看到的,它非常混乱,我完全确定它的发展方向。我希望有人可以给我一个小巧,快速和简洁的脚本,这样做会完全按照我的尝试。
答案 0 :(得分:0)
这是代码:) 由http://www.merchantos.com/makebeta/php/scraping-links-with-php/
帮助<?php
$my_url = 'http://online.bulsam.net';
$target_url = 'http://www.bulsam.net';
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
// find result
$result = is_my_link_there($hrefs, $my_url);
if ($result == 1) {
echo 'There is no link!!!';
} elseif ($result == 2) {
echo 'There is, but it is NO FOLLOW !!!';
} else {
// blah blah blah
}
// used functions
function is_my_link_there($hrefs, $my_url) {
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
if ($my_url == $url) {
$rel = $href->getAttribute('rel');
if ($rel == 'nofollow') {
return 2;
}
return 3;
}
}
return 1;
}