PHPCrawl - 尝试在类“PHPCrawlerUtils”上调用方法“getURIContent”

时间:2015-03-17 15:35:30

标签: symfony phpcrawl

我正在尝试将PHPCrawl与Symfony2一起使用。我首先使用composer安装了PHPCrawl库,然后在我的bundle中创建了一个文件夹“DependencyInjection”,我在其中放置了扩展PHPCrawler的类“MyCrawler”。我将其配置为服务。 现在,当我启动抓取过程时,Symfony会给我上述错误:

尝试在类“PHPCrawlerUtils”上调用方法“getURIContent”

我无法弄清楚为什么类存在,并且方法存在。

这是我的控制器动作:

    /**
 * Parcours le site concerné
 * 
 * @Route("/crawl", name="blog_crawl")
 * @Template()
 */
public function crawlAction($url = 'http://urlexample.net')
{               
    // Au lieu de créer une instance de la classe MyCrawler, je l'appelle en tant que service (config.yml)
    $crawl = $this->get('my_crawler');

    $crawl->setURL($url);

    // Analyse la balise content-type du document, autorise les pages de type text/html
    $crawl->addContentTypeReceiveRule("#text/html#"); 

    // Filtre les url trouvées dans la page en question - ici on garde les pages html uniquement
    $crawl->addURLFilterRule("#(jpg|gif|png|pdf|jpeg|svg|css|js)$# i"); 

    $crawl->enableCookieHandling(TRUE);

    // Sets a limit to the number of bytes the crawler should receive alltogether during crawling-process.
    $crawl->setTrafficLimit(0);

    // Sets a limit to the total number of requests the crawler should execute.
    $crawl->setRequestLimit(20);

    // Sets the content-size-limit for content the crawler should receive from documents.
    $crawl->setContentSizeLimit(0);

    // Sets the timeout in seconds for waiting for data on an established server-connection.
    $crawl->setStreamTimeout(20);

    // Sets the timeout in seconds for connection tries to hosting webservers.
    $crawl->setConnectionTimeout(20);

    $crawl->obeyRobotsTxt(TRUE);
    $crawl->setUserAgentString("Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0");

    $crawl->go();

    // At the end, after the process is finished, we print a short 
    // report (see method getProcessReport() for more information) 
    $report = $crawl->getProcessReport(); 

    echo "Summary:".'<br/>'; 
    echo "Links followed: ".$report->links_followed.'<br/>'; 
    echo "Documents received: ".$report->files_received.'<br/>'; 
    echo "Bytes received: ".$report->bytes_received." bytes".'<br/>'; 
    echo "Process runtime: ".$report->process_runtime." sec".'<br/>';
    echo "Abort reason: ".$report->abort_reason.'<br/>';


    return array(
        'varstuff' => 'something'
    );
}

这是DependencyInjection文件夹中的服务类MyCrawler:

<?php

namespace AppBundle\DependencyInjection;

use PHPCrawler;
use PHPCrawlerDocumentInfo;

/**
 * Description of MyCrawler
 *
 * @author Norman
 */
class MyCrawler extends PHPCrawler{

    /**
     * Récupère les infos d'une url
     * 
     * @param PHPCrawlerDocumentInfo $pageInfo
     */
    public function handleDocumentInfo(PHPCrawlerDocumentInfo $pageInfo)
    {                
        $page_url = $pageInfo->url;        
        $source = $pageInfo->source;
        $status = $pageInfo->http_status_code;

        // Si page "OK" (pas de code erreur) et non vide, affiche l'url
        if($status == 200 && $source!=''){
            echo $page_url.'<br/>';

            flush();            
        }
    }    
}

我也在sourceforge PHPCrawl论坛上寻求帮助但到目前为止没有成功...... 我应该补充一点,我从这里使用PHPCrawl 0.83:

https://github.com/mmerian/phpcrawl/

以下是问题似乎出现的类:

<?php
/**
 * Class for parsing robots.txt-files.
 *
 * @package phpcrawl
 * @internal
 */  
class PHPCrawlerRobotsTxtParser
{ 
  public function __construct()
  {
    // Init PageRequest-class
    if (!class_exists("PHPCrawlerHTTPRequest"))    include_once($classpath."/PHPCrawlerHTTPRequest.class.php");
    $this->PageRequest = new PHPCrawlerHTTPRequest();

  }

  /**
   * Parses a robots.txt-file and returns regular-expression-rules corresponding to the containing "disallow"-rules
   * that are adressed to the given user-agent.
   *
   * @param PHPCrawlerURLDescriptor $BaseUrl           The root-URL all rules from the robots-txt-file should relate to
   * @param string                  $user_agent_string The useragent all rules from the robots-txt-file should relate to
   * @param string                  $robots_txt_uri    Optional. The location of the robots.txt-file as URI.
   *                                                   If not set, the default robots.txt-file for the given BaseUrl gets parsed.
   *
   * @return array Numeric array containing regular-expressions for each "disallow"-rule defined in the robots.txt-file
   *               that's adressed to the given user-agent.
   */
  public function parseRobotsTxt(PHPCrawlerURLDescriptor $BaseUrl,   $user_agent_string, $robots_txt_uri = null)
  {
    PHPCrawlerBenchmark::start("processing_robotstxt");

    // If robots_txt_uri not given, use the default one for the given BaseUrl
    if ($robots_txt_uri === null)
      $robots_txt_uri = self::getRobotsTxtURL($BaseUrl->url_rebuild);

    // Get robots.txt-content
    $robots_txt_content = PHPCrawlerUtils::getURIContent($robots_txt_uri, $user_agent_string);

    $non_follow_reg_exps = array();

    // If content was found
    if ($robots_txt_content != null)
    {
      // Get all lines in the robots.txt-content that are adressed to our user-agent.
      $applying_lines = $this->getUserAgentLines($robots_txt_content, $user_agent_string);

      // Get valid reg-expressions for the given disallow-pathes.
      $non_follow_reg_exps = $this->buildRegExpressions($applying_lines, PHPCrawlerUtils::getRootUrl($BaseUrl->url_rebuild));
    }

    PHPCrawlerBenchmark::stop("processing_robots.txt");

    return $non_follow_reg_exps;
}

1 个答案:

答案 0 :(得分:0)

好的我觉得我解决了自己的问题。这里发生的是,当安装在Symfony2中时,mmerian PHPCrawler包会自动加载libs目录中的每个类。现在,有两个名为PHPCrawlerUtils的类。第一个是在自己的文件夹中,第二个是缺少getURIcontent方法。自动加载结束后,第二个占上风。 在主类PHPCrawler中,构造函数加载他需要的每个正确的类“如果类尚不存在”。这就是为什么没有加载正确的类。 最后,我包含了PHPCrawlerUtils类,但没有条件存在。