cURL不处理所有请求 - 在110左右停止

时间:2014-08-21 14:14:42

标签: php curl

因此我无法使此代码正常工作。我从博客中删除了它,它基于wordpress链接检查器。我在数据库中有大约6000个网址,我需要检查http状态,所以这似乎是一个很好的选择。我已经稍微修改了代码以满足我的需求而且它有效(有点)。

我在代码中检查了url_list数组,它包含了所有网址。问题是它在第110行之后基本上会停止执行,它有点随机但通常围绕这个数字。不确定我是否需要在某处设置超时或者我是否在代码中有错误。我注意到如果我将$max_connections设置为大于8,则会返回500错误。有什么建议吗?

<?php
// CONFIG
$db_host = 'localhost';
$db_user = 'test';
$db_pass = 'yearight';
$db_name = 'URLS';
$excluded_domains = array();
$max_connections = 7;

$dbh = new PDO('mysql:host=localhost;dbname=URLS', $db_user, $db_pass);

$sth = $dbh->prepare("SELECT url FROM list");
$sth->execute();

$result = $sth->fetchAll(PDO::FETCH_COLUMN, 0);

// initialize some variables
$url_list = array();
$working_urls = array();
$dead_urls = array();
$not_found_urls = array();
$active = null;

foreach($result as $d) {
    // get all links via regex
    if (preg_match_all('@((http?://)?([-\w]+\.[-\w\.]+)+\w(:\d+)?(/([-\w/_\.]*(\?\S+)?)?)*)@', $d, $matches)) {

        foreach ($matches[1] as $url) {

            // store the url
            $url_list []= $url;
        }
    }
}

// 1. multi handle
$mh = curl_multi_init();

// 2. add multiple URLs to the multi handle
for ($i = 0; $i < $max_connections; $i++) {
    add_url_to_multi_handle($mh, $url_list);
}

// 3. initial execution
do {
    $mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);

// 4. main loop
while ($active && $mrc == CURLM_OK) {

    // 5. there is activity
    if (curl_multi_select($mh) != -1) {

        // 6. do work
        do {
            $mrc = curl_multi_exec($mh, $active);
        } while ($mrc == CURLM_CALL_MULTI_PERFORM);

        // 7. is there info?
        if ($mhinfo = curl_multi_info_read($mh)) {
            // this means one of the requests were finished

            // 8. get the info on the curl handle
            $chinfo = curl_getinfo($mhinfo['handle']);

            // 9. dead link?
            if (!$chinfo['http_code']) {
                $dead_urls []= $chinfo['url'];

            // 10. 404?
            } else if ($chinfo['http_code'] == 404) {
                $not_found_urls []= $chinfo['url'];

            // 11. working
            } else {
                $working_urls []= $chinfo['url'];
            }

            // 12. remove the handle
            curl_multi_remove_handle($mh, $mhinfo['handle']);
            curl_close($mhinfo['handle']);

            // 13. add a new url and do work
            if (add_url_to_multi_handle($mh, $url_list)) {

                do {
                    $mrc = curl_multi_exec($mh, $active);
                } while ($mrc == CURLM_CALL_MULTI_PERFORM);
            }
        }
    }
}

// 14. finished
curl_multi_close($mh);

echo "==Dead URLs==<br/>";
echo implode("<br/>",$dead_urls) . "<br/><br/>";

echo "==404 URLs==<br>";
echo implode("<br/>",$not_found_urls) . "<br/><br/>";

echo "==Working URLs==<br/>";
echo implode("<br/>",$working_urls);

echo "<pre>";
var_dump($url_list); 
echo "</pre>";
// 15. adds a url to the multi handle
function add_url_to_multi_handle($mh, $url_list) {
    static $index = 0;

    // if we have another url to get
    if ($url_list[$index]) {

        // new curl handle
        $ch = curl_init();

        // set the url
        curl_setopt($ch, CURLOPT_URL, $url_list[$index]);
        // to prevent the response from being outputted
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        // follow redirections
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        // do not need the body. this saves bandwidth and time
        curl_setopt($ch, CURLOPT_NOBODY, 1);

        // add it to the multi handle
        curl_multi_add_handle($mh, $ch);


        // increment so next url is used next time
        $index++;

        return true;
    } else {

        // we are done adding new URLs
        return false;
    }
}
?>

更新:

我在bash中编写了一个与此相同的脚本。我注意到当我浏览文本文件时输出信息,当它失败时,它通常是返回奇怪的http状态代码的链接,如000522其中一些往往执行最多5分钟!所以我想知道cURL的PHP​​版本是否在遇到这些状态代码时停止执行。这只是一个想法,可能会增加更多价值来帮助解决问题。

1 个答案:

答案 0 :(得分:-1)

1 - 执行时间问题

2 - 在代码顶部声明MAX_EXECUTION_TIME,将帮助确定

bool set_time_limit(int $ seconds)