Question

我写了一个非常简单的PHP搜寻器，但是内存丢失有问题。代码是：

<?php
require_once 'db.php';

$homepage = 'https://example.com';
$query = "SELECT * FROM `crawled_urls`";
$response = @mysqli_query($dbc, $query);

$already_crawled = [];
$crawling = [];

while($row = mysqli_fetch_array($response)){
  $already_crawled[] = $row['crawled_url'];
  $crawling[] = $row['crawled_url'];
}

function follow_links($url){
  global $already_crawled;
  global $crawling;
  global $dbc;

  $doc = new DOMDocument();
  $doc->loadHTML(file_get_contents($url));

  $linklist = $doc->getElementsByTagName('a');

  foreach ($linklist as $link) {
    $l = $link->getAttribute("href");
    $full_link = 'https://example.com'.$l;

    if (!in_array($full_link, $already_crawled)) {

      // TODO: Fetch data from the crawled url and store it in the DB. Check if it was already stored.

      $query = 'INSERT INTO `crawled_urls`(`id`, `crawled_url`) VALUES (NULL,\'' . $full_link . '\')';
      $stmt = mysqli_prepare($dbc, $query);
      mysqli_stmt_execute($stmt);

      echo $full_link.PHP_EOL;
    }
  }

  array_shift($crawling);

  foreach ($crawling as $link) {
    follow_links($link);
  }
}

follow_links($homepage);

您能帮助我并与我分享避免这种巨大内存丢失的方法吗？当我开始该过程时，一切正常，但是内存稳定地上升到100％。

Answer 1

当您不再需要unset $doc时：

function follow_links($url){
  global $already_crawled;
  global $crawling;
  global $dbc;

  $doc = new DOMDocument();
  $doc->loadHTML(file_get_contents($url));

  $linklist = $doc->getElementsByTagName('a');

  unset($doc);

  foreach ($linklist as $link) {
    $l = $link->getAttribute("href");
    $full_link = 'https://example.com'.$l;

    if (!in_array($full_link, $already_crawled)) {

      // TODO: Fetch data from the crawled url and store it in the DB. Check if it was already stored.

      $query = 'INSERT INTO `crawled_urls`(`id`, `crawled_url`) VALUES (NULL,\'' . $full_link . '\')';
      $stmt = mysqli_prepare($dbc, $query);
      mysqli_stmt_execute($stmt);

      echo $full_link.PHP_EOL;
    }
  }

  array_shift($crawling);

  foreach ($crawling as $link) {
    follow_links($link);
  }
}

follow_links($homepage);

说明：您正在使用递归，也就是说，您基本上是在使用函数堆栈。这意味着，如果您有20个元素的堆栈，那么堆栈中所有功能的所有资源都会相应分配。深度越大，您使用的内存就越多。 $doc是主要问题，但您可能需要查看其他变量的用法，并确保在再次调用该函数时没有分配不必要的内容。

Answer 2

在调用函数之前，尝试unset变量$doc：

function follow_links($url){
  global $already_crawled;
  global $crawling;
  global $dbc;

  $doc = new DOMDocument();
  $doc->loadHTML(file_get_contents($url));

  $linklist = $doc->getElementsByTagName('a');

  foreach ($linklist as $link) {
    $l = $link->getAttribute("href");
    $full_link = 'https://example.com'.$l;

    if (!in_array($full_link, $already_crawled)) {

      // TODO: Fetch data from the crawled url and store it in the DB. Check if it was already stored.

      $query = 'INSERT INTO `crawled_urls`(`id`, `crawled_url`) VALUES (NULL,\'' . $full_link . '\')';
      $stmt = mysqli_prepare($dbc, $query);
      mysqli_stmt_execute($stmt);

      echo $full_link.PHP_EOL;
    }
  }

  array_shift($crawling);
  unset($doc);

  foreach ($crawling as $link) {
    follow_links($link);
  }
}

Answer 3

您的代码的主要问题是您正在使用递归。这样，您就可以将旧页面保留在内存中，尽管您不再需要它们。

尝试删除该递归。这应该相对容易，因为您已经在使用列表来存储链接。但是，我宁愿使用一个列表并将URL表示为对象。

其他一些事情：

您似乎有一个SQL注入漏洞，因此请学习正确使用准备好的语句
避免使用全局变量（您可以使函数返回链接列表）
如果您打算在其他人的网站上使用此代码，请确保您遵守robots.txt，限制抓取速度并确保不多次抓取页面

如果您想将此代码用于教育以外的其他用途，建议您使用一个库。这比从头开始创建搜寻器要容易得多。

PHP Crawler从服务器获取所有内存

3 个答案: