Question

我有一个网络机器人，它耗费了我的记忆力，经过一段时间，内存使用量达到50％，并且该过程被杀死;我不知道为什么内存使用会像这样增加，我没有包含“para.php”这是一个并行卷曲请求的库。我想了解更多有关网络抓取工具的信息，我搜索了很多，但找不到任何有用的文档或方法，我可以使用。

This is the library我从中获得了para.php。

我的代码：

require_once "para.php";

class crawling{

public $montent;


public function crawl_page($url){

    $m = new Mongo();

    $muun = $m->howto->en->findOne(array("_id" => $url));

    if (isset($muun)) {
        return;
    }

    $m->howto->en->save(array("_id" => $url));

    echo $url;

    echo "\n";

    $para = new ParallelCurl(10);

    $para->startRequest($url, array($this,'on_request_done'));

    $para->finishAllRequests();

    preg_match_all("(<a href=\"(.*)\")siU", $this->montent, $matk);

    foreach($matk[1] as $longu){
        $href = $longu;
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) && isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }


                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= $path;
            }
        }
        $this->crawl_page($longu);
    }
}

public function on_request_done($content) {
    $this->montent = $content;
}


$moj = new crawling;
$moj->crawl_page("http://www.example.com/");

Answer 1

您在1个网址上调用此crawl_page函数。获取内容（$ this-＆gt; montent）并检查链接（$ matk）。

虽然这些还没有销毁，但你会递归，开始对crawl_page的新调用。 $ this-＆gt;时刻将被新内容覆盖（没关系）。再往下一点，$ matk（一个新变量）填充了新的$ this-＆gt; montent的链接。此时，内存中有2个$ matk：一个包含您首先开始处理的文档的所有链接，另一个包含首次链接到原始文档中的文档的链接。

我建议找到所有链接＆amp;将它们保存到数据库（而不是立即递归）。然后只需清除数据库中的链接队列，逐个1（每个新文档向数据库添加一个新条目）

如何编写不占用大量内存的机器人？

1 个答案: