Question

我正在尝试抓取网站的网页以获取特定的文字内容。总是会添加新页面，所以我希望能够在每个页面中增加（使用固定格式的URL），直到我得到404.

页面采用以下格式：

http://thesite.com/page-1.html

http://thesite.com/page-2.html

http://thesite.com/page-3.html

...等...

一切顺利，直到它到达第36页，然后就死了（甚至没有达到404测试用例）。我知道这个例子中大约有100个页面，我可以手动查看它们而没有任何问题。此外，第36页没有错误。

测试用例 - 我尝试循环http://google.com 50次，并且对cURL递归没有任何问题。似乎是我真正想要的网站，或者我的服务器。

这似乎是对远程服务器或我的服务器的某种限制，因为我可以毫不拖延地一遍又一遍地运行这个页面，并且我总是在它死之前读取36页。

远程服务器可以设置cURL请求的限制吗？我还需要增加其他超时吗？这可能是服务器内存问题吗？

**递归刮擦函数：** （$ curl对象在第一次调用方法时创建，然后通过引用传递。我读到这比创建和关闭大量cURL对象更好）

function scrapeSite(&$curl,$preURL,$postURL,$parameters,$currentPage){
        //Format URL
        $formattedURL = $preURL.$currentPage.$postURL;
        echo "Formatted URL: ".$formattedURL."<br>";
        echo "Count: ".$currentPage."<br>";
        //Create CURL Object
        curl_setopt($curl, CURLOPT_URL, $formattedURL);

        //Set PHP Timeout
        set_time_limit(0);// to infinity for example
        //Check for 404
        $httpCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
        if($httpCode == 404 || $currentPage == 50) {
            curl_close($curl);
            return 'PAGE NOT FOUND<br>';
        }
        //Set other CURL Options
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($curl, CURLOPT_CONNECTTIMEOUT ,0); 
        curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
        curl_setopt($curl, CURLOPT_TIMEOUT, 400); //timeout in seconds
        $content = curl_exec($curl);
        $html = str_get_html($content);
        echo "Parameter Check: ".is_array($html->find($parameters))."<br>";
        if(is_array($html->find($parameters))>0){
            foreach($html->find($parameters) as $element) {
                echo "Text: ".$element->plaintext."<br>";
            }
            return scrapeSite($curl,$preURL,$postURL,$parameters,$currentPage+1);
        }else{
            echo "No Elements Found";
        }
    }

Answer 1

也许它只是内存限制问题尝试这个（在脚本的顶部）。

ini_set("memory_limit",-1);

你也说过＃34; ...或者我的服务器＆＃34; ，如果可以的话，只需阅读你的日志......

PHP cURL - 为什么脚本在第36次请求后死于远程URL？

1 个答案: