首先,您可能会从我的代码中获得我是一个菜鸟,并尝试构建一个链接爬虫,在页面中搜索链接 - >然后跟随每个链接并将链接放入一个数组。最后,我们应该在该网站上找到所有链接。
代码如下所示:
<?php
$to_crawl = "http://reteteculinare.ro";
$c = array();
$final = array();
function get_Links($to_crawl){
global $c, $final;
$input = @file_get_contents($to_crawl);
$base_url = parse_url($to_crawl, PHP_URL_HOST);
$regexp = '<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>';
preg_match_all("/$regexp/siU", $input, $matches);
$l = $matches[2];
foreach ($l as $link) {
if(strpos($link, "#")) {
$link = substr($link, 0, strpos($link, "#"));
}
if(substr($link, 0, 1) == "."){
$link = substr($link, 1);
}
if(substr($link, 0, 7) == "http://"){
$link = $link;
} else if (substr($link, 0, 8) == "https://"){
$link = $link;
} else if (substr($link, 0, 4) == "www."){
$link = substr($link, 4);
} else if (substr($link, 0, 6) == "//wwww."){
$link = substr($link, 6);
} else if (substr($link, 0, 2) == "//"){
$link = substr($link, 2);
} else if (substr($link, 0, 1) == "#"){
$link = $to_crawl;
} else if (substr($link, 0, 7) == "mailto:"){
$link = "[".$link."]";
} else {
if(substr($link, 0, 1) != "/") {
$link = $base_url."/".$link;
} else {
$link = $base_url.$link;
}
}
if(substr($link, 0, 4) == "www."){
$link = substr($link, 4);
}
if(substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && substr($link, 0, 1) != "[") {
$link = "http://".$link;
}
if (!in_array($link, $c)) {
array_push($c, $link);
}
}
}
get_links($to_crawl);
foreach ((array)$c as $page) {
get_links($page);
foreach ((array)$c as $page) {
if (!in_array($page, $final)) {
$final[] = $page;
echo '<pre>';
print_r($final);
echo '</pre>';
}
}
}
?>
当前代码中的问题是,每次添加新页面时它都会打印数组,我们会得到类似的结果:
Array
(
[0] => http://reteteculinare.ro/autentificare/
)
Array
(
[0] => http://reteteculinare.ro/autentificare/
[1] => http://reteteculinare.ro/inregistrare/
)
Array
(
[0] => http://reteteculinare.ro/autentificare/
[1] => http://reteteculinare.ro/inregistrare/
[2] => http://reteteculinare.ro/
)
Array
(
[0] => http://reteteculinare.ro/autentificare/
[1] => http://reteteculinare.ro/inregistrare/
[2] => http://reteteculinare.ro/
[3] => http://reteteculinare.ro/retete/
)
Array
(
[0] => http://reteteculinare.ro/autentificare/
[1] => http://reteteculinare.ro/inregistrare/
[2] => http://reteteculinare.ro/
[3] => http://reteteculinare.ro/retete/
[4] => http://reteteculinare.ro/mixer-ingrediente/
)
Array
(
[0] => http://reteteculinare.ro/autentificare/
[1] => http://reteteculinare.ro/inregistrare/
[2] => http://reteteculinare.ro/
[3] => http://reteteculinare.ro/retete/
[4] => http://reteteculinare.ro/mixer-ingrediente/
[5] => http://reteteculinare.ro/reteta_saptamanii/
)
...
注意: - 尝试将打印放在第二个foreach的末尾,我仍然可以获得多个数组。 - 尝试在第一个foreach之后放置打印件,它不会输出任何内容。
我做错了什么,如何在所有脚本完成后打印$ final数组?
干杯!
答案 0 :(得分:0)
将print_r($final);
移到foreach