使用简单HTML DOM(递归)查找嵌套链接

时间:2017-06-20 10:48:42

标签: php parsing recursion web-crawler

我是关于编程的新手,所以这是我的问题。我正在尝试构建一个递归的php蜘蛛usind Simple HTML DOM Parser,爬进某个网站并返回一个包含2xx,3xx,4xx和5xx的页面列表。我一直在寻找解决方案,但是(可能由于我的经验不足)我还没有找到任何工作。我的实际代码找到了根/索引页面上的所有链接,但是我希望能够递归地找到那些先前找到的链接中的链接,依此类推,例如到5级。假设根页面为0级,则递归我写的函数只显示1级链接,重复5次。任何帮助赞赏。感谢。

<?php
  echo "<strong><h1>Sitemap</h1></strong><br>";

  include_once('simple_html_dom.php');

  $url = "http://www.gnet.it/";
  $html = new simple_html_dom();
  $html->load_file($url);
  echo "<strong><h2>Int Links</h2></strong><br>";
  foreach($html->find("a") as $a)
  {
    if((!(preg_match('#^(?:https?|ftp)://.+$#', $a->href)))&&($a->href != null)&&($a->href != "javascript:;")&&($a->href != "#"))
    {
    echo "<strong>" . $a->href . "</strong><br>";
    }
  }

  echo "<strong><h2>Ext Links</h2></strong><br>";
  foreach($html->find("a") as $a)
  {
    if(((preg_match('#^(?:https?|ftp)://.+$#', $a->href)))&&($a->href != null)&&($a->href != "javascript:;")&&($a->href != "#"))
    {
    echo "<strong>" . $a->href . "</strong><br>";
    }
  }


//recursion

    $depth = 1;
    $maxDepth = 5;
    $recurl = "$a->href";
    $rechtml = new simple_html_dom();
    $rechtml->load_file($recurl);
      while($depth <= $maxDepth){
        echo "<strong><h2>Link annidati livello $depth</h2></strong><br>";
        foreach($rechtml->find("a") as $a)
        {
          if(($a->href != null))
          {
          echo "<strong>" . $a->href . "</strong><br>";
          }
        }
        $depth++;
      }


//csv

  echo "<strong><h1>Google Crawl Errors from CSV</h1></strong><br>";
  echo "<table>\n\n";
$f = fopen("CrawlErrors.csv", "r");
while (($line = fgetcsv($f)) !== false) {
        echo "<tr>";
        foreach ($line as $cell) {
                echo "<td>" . htmlspecialchars($cell) . "</td>";
        }
        echo "</tr>\n";
}
fclose($f);
echo "\n</table>";
?>

1 个答案:

答案 0 :(得分:0)

试试这个:

我在基本的刮刀中调用此例程,以递归方式查找网站上的所有链接。您必须设置一些逻辑来阻止它抓取链接到您网站上的网页的外部网站,否则您将永远在线运行!

注意,我确实从前一个SO线程获得了大部分代码,所以答案就在那里。

function crawl_page($url, $depth = 2){

// strip trailing slash from URL
if(substr($url, -1) == '/') {
    $url= substr($url, 0, -1);
}

// which URLs have we already crawled?
static $seen = array();
if (isset($seen[$url]) || $depth === 0) {
    return;
}
$seen[$url] = true;

$dom = new DOMDocument('1.0');
@$dom->loadHTMLFile($url);

$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
    $href = $element->getAttribute('href');
    if (0 !== strpos($href, 'http')) {
        // build the URLs to the same standard - with http:// etc
        $path = '/' . ltrim($href, '/');
        if (extension_loaded('http')) {
            $href = http_build_url($url, array('path' => $path));
        } else {
            $parts = parse_url($url);
            $href = $parts['scheme'] . '://';
            if (isset($parts['user']) && isset($parts['pass'])) {
                $href .= $parts['user'] . ':' . $parts['pass'] . '@';
            }
            $href .= $parts['host'];
            if (isset($parts['port'])) {
                $href .= ':' . $parts['port'];
            }
            $href .= $path;
        }
    }
    crawl_page($href, $depth - 1);
}

// pull out the actual page name without any parent dirs
$pos = strrpos($url, '/');
$slug = $pos === false ? "root" : substr($url, $pos + 1);

echo "slug:" . $slug . "<br>";
}