PHP - Web爬虫不打印标题或描述

时间:2016-12-28 23:37:44

标签: php web-crawler

我在PHP中制作一个非常基本的网络爬虫作为项目,而我等待CS50,这是我到目前为止所做的。

<?php

$start = "http://localhost/~jordanbaron/Web%20Crawler/input.html";

$already_crawled = array();

function get_details($url)
{
  global $already_crawled;

  $doc = new DOMDocument();
  @$doc->loadHTML(@file_get_contents($url, false, stream_context_create(array('http'=>array('method'=> "GET", 'headers'=>"User-Agent: jordanBot\n")))));

  $title = $doc->getElementsByTagName("title");
  $title = $title->item(0)->nodeValue;

  $description = "";
  $keywords = "";
  $metas = $doc->getElementsByTagName("meta");

  for ($i = 0; $i < $metas->length; $i++)
  {
    $meta = $metas->item($i);

    if ($meta->getAttribute("name") == strtolower("description"))
      $description = $meta->getAttribute("content");
    if ($meta->getAttribute("name") == strtolower("keywords"))
      $keywords = $meta->getAttribute("content");
  }
  return '{ "Title": "'.$title.'", "Description": "'.str_replace("\n", "", $description).'", "Keywords": "'.$keywords.'"}';
}

function follow_links($url)
{

  global $already_crawled;

  $doc = new DOMDocument();
  @$doc->loadHTML(@file_get_contents($url, false, stream_context_create(array('http'=>array('method'=> "GET", 'headers'=>"User-Agent: jordanBot\n")))));

  $linklist = $doc->getElementsByTagName("a");

  foreach ($linklist as $link)
  {
    $l = $link->getAttribute("href")."\n";


    if (substr($l, 0, 1) == "/" && substr($l, 0, 2) != "//")
    {
      $l = parse_url($url)["scheme"]."://".parse_url($url)["host"].$l;
    }
    else if (substr($l, 0, 2) == "//")
    {
      $l = parse_url($url)["scheme"].":".$l;

    }
    else if (substr($l, 0, 2) == "./")
    {
      $l = parse_url($url)["scheme"]."://".parse_url($url)["host"].dirname(parse_url($url)["path"]).substr($l, 1);
    }
    else if (substr($l, 0, 1) == "#")
    {
      $l = parse_url($url)["scheme"]."://".parse_url($url)["host"].parse_url($url)["path"].$l;
    }
    else if (substr($l, 0, 3) == "../")
    {
      $l = parse_url($url)["scheme"]."://".parse_url($url)["host"]."/".$l;
    }
    else if (substr($l, 0, 5) != "https" && substr($l, 0, 4) != "http")
    {
      $l = parse_url($url)["scheme"]."://".parse_url($url)["host"]."/".$l;
    }
    else if (substr($s, 0, 11) == "javascript:")
    {
      continue;
    }

    if (!in_array($l, $already_crawled))
    {
      $already_crawled[] = $l;
      echo get_details($l)."\n";
      //echo $l."\n";
    }


  }
}

follow_links($start);

print_r($already_crawled);

我遇到的一个问题是,对于google.com <a>代码,我得到的结果为{ "Title": "", "Description": "", "Keywords": ""},而不是像{ "Title": "Google", "Description": "", "Keywords": ""}这样的内容如果有帮助,我会关注{ {3}} howCode教程

0 个答案:

没有答案