如何使这个爬虫更有效

时间:2017-06-20 16:40:42

标签: php performance web-scraping web-crawler

我构建了这个网络抓取工具。

https://github.com/shoutweb/WebsiteCrawlerEmailExtractor

//Regular expression function that scans individual pages for emails
    function get_emails_from_webpage($url)
    {
      $text=file_get_contents($url);
      $res = preg_match_all("/[a-z0-9]+[_a-z0-9\.-]*[a-z0-9]+@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})/i",$text,$matches);
      if ($res) {
          return array_unique($matches[0]);
      }
      else{
          return null;
      }
    }

//URL Array
$URLArray = array();

//Inputted URL right now it just pulls it from a GET variable but you can do alter this any way you want
$inputtedURL = $_GET['url'];


//Crawling the inputted domain to get the URLS
$urlContent = file_get_contents("http://".urldecode($inputtedURL));
$dom = new DOMDocument();
@$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

$scrapedEmails = array();

for($i = 0; $i < $hrefs->length; $i++){
    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');
    $url = filter_var($url, FILTER_SANITIZE_URL);
    //array_push($scrapedEmails, $hrefs->length);
    // validate url
    if(!filter_var($url, FILTER_VALIDATE_URL) === false){
        if (strpos($url, $inputtedURL) !== false) {
                array_push($URLArray, $url);
            }

    }
}

//Extracting the emails from URLS that were crawled
foreach ($URLArray as $key => $url) {
    $emails = get_emails_from_webpage($url);

    if($emails != null){
      foreach($emails as $email) {
          if(!in_array($email, $scrapedEmails)){
            array_push($scrapedEmails,$email);
        }
      }
    } 
}


//Ouputting the scraped emails in addition to the the number of URLS crawled
foreach($scrapedEmails as $value) {
    echo $value . " " . count($URLArray);
}

它基本上会进入您输入的域,获取所有页面,然后检查是否有电子邮件。

每个域最多可能需要30秒才能抓取。我想看看是否有办法加速这个webcrawler。我想的一种方法是将其限制为仅联系页面,但我无法找到一种聪明的方法。

1 个答案:

答案 0 :(得分:0)

如果你的意图不是邪恶的 -

正如评论中所提到的,实现此目的的一种方法是并行执行爬虫(多线程)---而不是一次执行一个域。

类似的东西:

exec('php crawler.php > /dev/null 2>&1 &');
exec('php crawler.php > /dev/null 2>&1 &');
exec('php crawler.php > /dev/null 2>&1 &');
exec('php crawler.php > /dev/null 2>&1 &');
exec('php crawler.php > /dev/null 2>&1 &');

在服务器上,您可以设置一个CRON作业,该作业将自动执行此操作,以便您不会手动运行。