我在PHP中制作一个非常基本的网络爬虫作为项目,而我等待CS50,这是我到目前为止所做的。
<?php
$start = "http://localhost/~jordanbaron/Web%20Crawler/input.html";
$already_crawled = array();
function get_details($url)
{
global $already_crawled;
$doc = new DOMDocument();
@$doc->loadHTML(@file_get_contents($url, false, stream_context_create(array('http'=>array('method'=> "GET", 'headers'=>"User-Agent: jordanBot\n")))));
$title = $doc->getElementsByTagName("title");
$title = $title->item(0)->nodeValue;
$description = "";
$keywords = "";
$metas = $doc->getElementsByTagName("meta");
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if ($meta->getAttribute("name") == strtolower("description"))
$description = $meta->getAttribute("content");
if ($meta->getAttribute("name") == strtolower("keywords"))
$keywords = $meta->getAttribute("content");
}
return '{ "Title": "'.$title.'", "Description": "'.str_replace("\n", "", $description).'", "Keywords": "'.$keywords.'"}';
}
function follow_links($url)
{
global $already_crawled;
$doc = new DOMDocument();
@$doc->loadHTML(@file_get_contents($url, false, stream_context_create(array('http'=>array('method'=> "GET", 'headers'=>"User-Agent: jordanBot\n")))));
$linklist = $doc->getElementsByTagName("a");
foreach ($linklist as $link)
{
$l = $link->getAttribute("href")."\n";
if (substr($l, 0, 1) == "/" && substr($l, 0, 2) != "//")
{
$l = parse_url($url)["scheme"]."://".parse_url($url)["host"].$l;
}
else if (substr($l, 0, 2) == "//")
{
$l = parse_url($url)["scheme"].":".$l;
}
else if (substr($l, 0, 2) == "./")
{
$l = parse_url($url)["scheme"]."://".parse_url($url)["host"].dirname(parse_url($url)["path"]).substr($l, 1);
}
else if (substr($l, 0, 1) == "#")
{
$l = parse_url($url)["scheme"]."://".parse_url($url)["host"].parse_url($url)["path"].$l;
}
else if (substr($l, 0, 3) == "../")
{
$l = parse_url($url)["scheme"]."://".parse_url($url)["host"]."/".$l;
}
else if (substr($l, 0, 5) != "https" && substr($l, 0, 4) != "http")
{
$l = parse_url($url)["scheme"]."://".parse_url($url)["host"]."/".$l;
}
else if (substr($s, 0, 11) == "javascript:")
{
continue;
}
if (!in_array($l, $already_crawled))
{
$already_crawled[] = $l;
echo get_details($l)."\n";
//echo $l."\n";
}
}
}
follow_links($start);
print_r($already_crawled);
我遇到的一个问题是,对于google.com <a>
代码,我得到的结果为{ "Title": "", "Description": "", "Keywords": ""}
,而不是像{ "Title": "Google", "Description": "", "Keywords": ""}
这样的内容如果有帮助,我会关注{ {3}} howCode教程