如何在PHP中创建一个简单的爬虫?

时间:2010-02-22 18:23:54

标签: php web-crawler

我有一个包含大量链接的网页。我想编写一个脚本,将脚本中包含的所有数据转储到本地文件中。

有人用PHP做过吗?一般准则和陷阱就足以作为答案。

14 个答案:

答案 0 :(得分:89)

咩。不要parse HTML with regexes

这是一个受Tatu's启发的DOM版本:

<?php
function crawl_page($url, $depth = 5)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }

    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) && isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= dirname($parts['path'], 1).$path;
            }
        }
        crawl_page($href, $depth - 1);
    }
    echo "URL:",$url,PHP_EOL,"CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;
}
crawl_page("http://hobodave.com", 2);

编辑:我修复了Tatu版本的一些错误(现在使用相对URL)。

编辑:我添加了一些新功能,阻止它两次关注相同的网址。

编辑:现在将输出回显到STDOUT,以便您可以将其重定向到您想要的任何文件

编辑修正了乔治在答案中指出的错误。相对网址将不再附加到网址路径的末尾,而是覆盖它。感谢George为此。请注意,George的答案不包括以下任何一项:https,user,pass或port。如果您已加载http PECL扩展程序,则可以使用http_build_url完成此操作。否则,我必须使用parse_url手动粘合在一起。再次感谢乔治。

答案 1 :(得分:15)

这里我的实现基于上面的例子/答案。

  1. 基于班级
  2. 使用Curl
  3. 支持HTTP Auth
  4. 跳过不属于基本域的网址
  5. 返回每页的Http标头响应代码
  6. 每页的退货时间
  7. CRAWL CLASS:

    class crawler
    {
        protected $_url;
        protected $_depth;
        protected $_host;
        protected $_useHttpAuth = false;
        protected $_user;
        protected $_pass;
        protected $_seen = array();
        protected $_filter = array();
    
        public function __construct($url, $depth = 5)
        {
            $this->_url = $url;
            $this->_depth = $depth;
            $parse = parse_url($url);
            $this->_host = $parse['host'];
        }
    
        protected function _processAnchors($content, $url, $depth)
        {
            $dom = new DOMDocument('1.0');
            @$dom->loadHTML($content);
            $anchors = $dom->getElementsByTagName('a');
    
            foreach ($anchors as $element) {
                $href = $element->getAttribute('href');
                if (0 !== strpos($href, 'http')) {
                    $path = '/' . ltrim($href, '/');
                    if (extension_loaded('http')) {
                        $href = http_build_url($url, array('path' => $path));
                    } else {
                        $parts = parse_url($url);
                        $href = $parts['scheme'] . '://';
                        if (isset($parts['user']) && isset($parts['pass'])) {
                            $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                        }
                        $href .= $parts['host'];
                        if (isset($parts['port'])) {
                            $href .= ':' . $parts['port'];
                        }
                        $href .= $path;
                    }
                }
                // Crawl only link that belongs to the start domain
                $this->crawl_page($href, $depth - 1);
            }
        }
    
        protected function _getContent($url)
        {
            $handle = curl_init($url);
            if ($this->_useHttpAuth) {
                curl_setopt($handle, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
                curl_setopt($handle, CURLOPT_USERPWD, $this->_user . ":" . $this->_pass);
            }
            // follows 302 redirect, creates problem wiht authentication
    //        curl_setopt($handle, CURLOPT_FOLLOWLOCATION, TRUE);
            // return the content
            curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);
    
            /* Get the HTML or whatever is linked in $url. */
            $response = curl_exec($handle);
            // response total time
            $time = curl_getinfo($handle, CURLINFO_TOTAL_TIME);
            /* Check for 404 (file not found). */
            $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
    
            curl_close($handle);
            return array($response, $httpCode, $time);
        }
    
        protected function _printResult($url, $depth, $httpcode, $time)
        {
            ob_end_flush();
            $currentDepth = $this->_depth - $depth;
            $count = count($this->_seen);
            echo "N::$count,CODE::$httpcode,TIME::$time,DEPTH::$currentDepth URL::$url <br>";
            ob_start();
            flush();
        }
    
        protected function isValid($url, $depth)
        {
            if (strpos($url, $this->_host) === false
                || $depth === 0
                || isset($this->_seen[$url])
            ) {
                return false;
            }
            foreach ($this->_filter as $excludePath) {
                if (strpos($url, $excludePath) !== false) {
                    return false;
                }
            }
            return true;
        }
    
        public function crawl_page($url, $depth)
        {
            if (!$this->isValid($url, $depth)) {
                return;
            }
            // add to the seen URL
            $this->_seen[$url] = true;
            // get Content and Return Code
            list($content, $httpcode, $time) = $this->_getContent($url);
            // print Result for current Page
            $this->_printResult($url, $depth, $httpcode, $time);
            // process subPages
            $this->_processAnchors($content, $url, $depth);
        }
    
        public function setHttpAuth($user, $pass)
        {
            $this->_useHttpAuth = true;
            $this->_user = $user;
            $this->_pass = $pass;
        }
    
        public function addFilterPath($path)
        {
            $this->_filter[] = $path;
        }
    
        public function run()
        {
            $this->crawl_page($this->_url, $this->_depth);
        }
    }
    

    用法:

    // USAGE
    $startURL = 'http://YOUR_URL/';
    $depth = 6;
    $username = 'YOURUSER';
    $password = 'YOURPASS';
    $crawler = new crawler($startURL, $depth);
    $crawler->setHttpAuth($username, $password);
    // Exclude path with the following structure to be processed 
    $crawler->addFilterPath('customer/account/login/referer');
    $crawler->run();
    

答案 2 :(得分:11)

查看PHP Crawler

http://sourceforge.net/projects/php-crawler/

看看它是否有帮助。

答案 3 :(得分:9)

以最简单的形式:

function crawl_page($url, $depth = 5) {
    if($depth > 0) {
        $html = file_get_contents($url);

        preg_match_all('~<a.*?href="(.*?)".*?>~', $html, $matches);

        foreach($matches[1] as $newurl) {
            crawl_page($newurl, $depth - 1);
        }

        file_put_contents('results.txt', $newurl."\n\n".$html."\n\n", FILE_APPEND);
    }
}

crawl_page('http://www.domain.com/index.php', 5);

该函数将从页面获取内容,然后抓取所有找到的链接并将内容保存到“results.txt”。函数接受第二个参数depth,它定义了应该遵循链接的时间。如果您只想解析给定页面中的链接,请在那里传递1。

答案 4 :(得分:5)

为什么要使用PHP,当你可以使用wget时,例如

wget -r -l 1 http://www.example.com

有关如何解析内容,请参阅Best Methods to parse HTML并使用examples的搜索功能。以前已经多次回答过如何解析HTML。

答案 5 :(得分:5)

hobodave's代码进行一些细微更改后,这里有一个可用于抓取网页的代码片段。这需要在服务器中启用curl扩展。

<?php
//set_time_limit (0);
function crawl_page($url, $depth = 5){
$seen = array();
if(($depth == 0) or (in_array($url, $seen))){
    return;
}   
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result = curl_exec ($ch);
curl_close ($ch);
if( $result ){
    $stripped_file = strip_tags($result, "<a>");
    preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $stripped_file, $matches, PREG_SET_ORDER ); 
    foreach($matches as $match){
        $href = $match[1];
            if (0 !== strpos($href, 'http')) {
                $path = '/' . ltrim($href, '/');
                if (extension_loaded('http')) {
                    $href = http_build_url($href , array('path' => $path));
                } else {
                    $parts = parse_url($href);
                    $href = $parts['scheme'] . '://';
                    if (isset($parts['user']) && isset($parts['pass'])) {
                        $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                    }
                    $href .= $parts['host'];
                    if (isset($parts['port'])) {
                        $href .= ':' . $parts['port'];
                    }
                    $href .= $path;
                }
            }
            crawl_page($href, $depth - 1);
        }
}   
echo "Crawled {$href}";
}   
crawl_page("http://www.sitename.com/",3);
?>

我已在此crawler script tutorial

中解释了本教程

答案 6 :(得分:3)

Hobodave你非常接近。我唯一改变的是在if语句中检查找到的锚标记的href属性是否以'http'开头。而不是简单地添加包含传递的页面的$ url变量,你必须首先将它剥离到主机,这可以使用parse_url php函数完成。

<?php
function crawl_page($url, $depth = 5)
{
  static $seen = array();
  if (isset($seen[$url]) || $depth === 0) {
    return;
  }

  $seen[$url] = true;

  $dom = new DOMDocument('1.0');
  @$dom->loadHTMLFile($url);

  $anchors = $dom->getElementsByTagName('a');
  foreach ($anchors as $element) {
    $href = $element->getAttribute('href');
    if (0 !== strpos($href, 'http')) {
       /* this is where I changed hobodave's code */
        $host = "http://".parse_url($url,PHP_URL_HOST);
        $href = $host. '/' . ltrim($href, '/');
    }
    crawl_page($href, $depth - 1);
  }

  echo "New Page:<br /> ";
  echo "URL:",$url,PHP_EOL,"<br />","CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL,"  <br /><br />";
}

crawl_page("http://hobodave.com/", 5);
?>

答案 7 :(得分:2)

如前所述,有一些爬虫框架已准备好进行自定义,但如果您正在做的事情就像您提到的一样简单,那么您可以非常轻松地从头开始。

抓取链接:http://www.phpro.org/examples/Get-Links-With-DOM.html

将结果转储到文件:http://www.tizag.com/phpT/filewrite.php

答案 8 :(得分:1)

我使用@hobodave的代码,通过这个小调整来防止重新抓取同一网址的所有片段变体:

<?php
function crawl_page($url, $depth = 5)
{
  $parts = parse_url($url);
  if(array_key_exists('fragment', $parts)){
    unset($parts['fragment']);
    $url = http_build_url($parts);
  }

  static $seen = array();
  ...

然后你也可以省略for循环中的$parts = parse_url($url);行。

答案 9 :(得分:1)

您可以尝试这可能对您有所帮助

$search_string = 'american golf News: Fowler beats stellar field in Abu Dhabi';
$html = file_get_contents(url of the site);
$dom = new DOMDocument;
$titalDom = new DOMDocument;
$tmpTitalDom = new DOMDocument;
libxml_use_internal_errors(true);
@$dom->loadHTML($html);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$videos = $xpath->query('//div[@class="primary-content"]');
foreach ($videos as $key => $video) {
$newdomaindom = new DOMDocument;    
$newnode = $newdomaindom->importNode($video, true);
$newdomaindom->appendChild($newnode);
@$titalDom->loadHTML($newdomaindom->saveHTML());
$xpath1 = new DOMXPath($titalDom);
$titles = $xpath1->query('//div[@class="listingcontainer"]/div[@class="list"]');
if(strcmp(preg_replace('!\s+!',' ',  $titles->item(0)->nodeValue),$search_string)){     
    $tmpNode = $tmpTitalDom->importNode($video, true);
    $tmpTitalDom->appendChild($tmpNode);
    break;
}
}
echo $tmpTitalDom->saveHTML();

答案 10 :(得分:0)

我提出了以下蜘蛛代码。 我从以下几点改编了一下: PHP - Is the there a safe way to perform deep recursion? 它似乎相当迅速......

    <?php
function  spider( $base_url , $search_urls=array() ) {
    $queue[] = $base_url;
    $done           =   array();
    $found_urls     =   array();
    while($queue) {
            $link = array_shift($queue);
            if(!is_array($link)) {
                $done[] = $link;
                foreach( $search_urls as $s) { if (strstr( $link , $s )) { $found_urls[] = $link; } }
                if( empty($search_urls)) { $found_urls[] = $link; }
                if(!empty($link )) {
echo 'LINK:::'.$link;
                      $content =    file_get_contents( $link );
//echo 'P:::'.$content;
                    preg_match_all('~<a.*?href="(.*?)".*?>~', $content, $sublink);
                    if (!in_array($sublink , $done) && !in_array($sublink , $queue)  ) {
                           $queue[] = $sublink;
                    }
                }
            } else {
                    $result=array();
                    $return = array();
                    // flatten multi dimensional array of URLs to one dimensional.
                    while(count($link)) {
                         $value = array_shift($link);
                         if(is_array($value))
                             foreach($value as $sub)
                                $link[] = $sub;
                         else
                               $return[] = $value;
                     }
                     // now loop over one dimensional array.
                     foreach($return as $link) {
                                // echo 'L::'.$link;
                                // url may be in form <a href.. so extract what's in the href bit.
                                preg_match_all('/<a[^>]+href=([\'"])(?<href>.+?)\1[^>]*>/i', $link, $result);
                                if ( isset( $result['href'][0] )) { $link = $result['href'][0]; }
                                // add the new URL to the queue.
                                if( (!strstr( $link , "http")) && (!in_array($base_url.$link , $done)) && (!in_array($base_url.$link , $queue)) ) {
                                     $queue[]=$base_url.$link;
                                } else {
                                    if ( (strstr( $link , $base_url  ))  && (!in_array($base_url.$link , $done)) && (!in_array($base_url.$link , $queue)) ) {
                                         $queue[] = $link;
                                    }
                                }
                      }
            }
    }


    return $found_urls;
}    


    $base_url       =   'https://www.houseofcheese.co.uk/';
    $search_urls    =   array(  $base_url.'acatalog/' );
    $done = spider( $base_url  , $search_urls  );

    //
    // RESULT
    //
    //
    echo '<br /><br />';
    echo 'RESULT:::';
    foreach(  $done as $r )  {
        echo 'URL:::'.$r.'<br />';
    }

答案 11 :(得分:0)

值得记住的是,在抓取外部链接时(我很欣赏OP与用户自己的页面相关),您应该了解robots.txt。我发现以下内容有望帮助http://www.the-art-of-web.com/php/parse-robots/

答案 12 :(得分:0)

我创建了一个小类来从提供的URL中获取数据,然后提取您选择的html元素。该类使用CURL和DOMDocument。

php class:

class crawler {


   public static $timeout = 2;
   public static $agent   = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)';


   public static function http_request($url) {
      $ch = curl_init();
      curl_setopt($ch, CURLOPT_URL,            $url);
      curl_setopt($ch, CURLOPT_USERAGENT,      self::$agent);
      curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, self::$timeout);
      curl_setopt($ch, CURLOPT_TIMEOUT,        self::$timeout);
      curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
      $response = curl_exec($ch);
      curl_close($ch);
      return $response;
   }


   public static function strip_whitespace($data) {
      $data = preg_replace('/\s+/', ' ', $data);
      return trim($data);
   }


   public static function extract_elements($tag, $data) {
      $response = array();
      $dom      = new DOMDocument;
      @$dom->loadHTML($data);
      foreach ( $dom->getElementsByTagName($tag) as $index => $element ) {
         $response[$index]['text'] = self::strip_whitespace($element->nodeValue);
         foreach ( $element->attributes as $attribute ) {
            $response[$index]['attributes'][strtolower($attribute->nodeName)] = self::strip_whitespace($attribute->nodeValue);
         }
      }
      return $response;
   }


}

示例用法:

$data  = crawler::http_request('https://stackoverflow.com/questions/2313107/how-do-i-make-a-simple-crawler-in-php');
$links = crawler::extract_elements('a', $data);
if ( count($links) > 0 ) {
   file_put_contents('links.json', json_encode($links, JSON_PRETTY_PRINT));
}

示例回复:

[
    {
        "text": "Stack Overflow",
        "attributes": {
            "href": "https:\/\/stackoverflow.com",
            "class": "-logo js-gps-track",
            "data-gps-track": "top_nav.click({is_current:false, location:2, destination:8})"
        }
    },
    {
        "text": "Questions",
        "attributes": {
            "id": "nav-questions",
            "href": "\/questions",
            "class": "-link js-gps-track",
            "data-gps-track": "top_nav.click({is_current:true, location:2, destination:1})"
        }
    },
    {
        "text": "Developer Jobs",
        "attributes": {
            "id": "nav-jobs",
            "href": "\/jobs?med=site-ui&ref=jobs-tab",
            "class": "-link js-gps-track",
            "data-gps-track": "top_nav.click({is_current:false, location:2, destination:6})"
        }
    }
]

答案 13 :(得分:0)

这是一个古老的问题。从那以后发生了很多好事。这是我关于这个话题的两分钱:

  1. 要准确跟踪访问的页面,必须先对URI进行标准化。归一化算法包括多个步骤:

    • 排序查询参数。例如,以下URI在标准化后是等效的: GET http://www.example.com/query?id=111&cat=222 GET http://www.example.com/query?cat=222&id=111
    • 转换空路径。 示例:http://example.org → http://example.org/

    • 大写百分比编码。百分比编码三元组中的所有字母(例如“%3A”)不区分大小写。 示例:http://example.org/a%c2%B1b → http://example.org/a%C2%B1b

    • 删除不必要的点段。 示例:http://example.org/../a/b/../c/./d.html → http://example.org/a/c/d.html

    • 可能还有其他一些规范化规则

  2. 不仅<a>标签具有href属性,<area>标签也具有https://html.com/tags/area/。如果您不想错过任何内容,则也必须抓取<area>标签。

  3. 跟踪抓取进度。如果网站很小,那不是问题。相反,如果您爬网该站点的一半而失败,则可能会非常令人沮丧。考虑使用数据库或文件系统来存储进度。

  4. 对网站所有者要友善。 如果您打算在网站之外使用搜寻器,则必须使用延迟。没有延迟,脚本太快了,可能会大大降低某些小型站点的速度。从系统管理员的角度来看,它看起来像是DoS攻击。请求之间的静态延迟将解决问题。

如果您不想处理此问题,请尝试Crawlzone并让我知道您的反馈意见。另外,请查看我前一段时间写的文章https://www.codementor.io/zstate/this-is-how-i-crawl-n98s6myxm