因此我无法使此代码正常工作。我从博客中删除了它,它基于wordpress链接检查器。我在数据库中有大约6000个网址,我需要检查http状态,所以这似乎是一个很好的选择。我已经稍微修改了代码以满足我的需求而且它有效(有点)。
我在代码中检查了url_list
数组,它包含了所有网址。问题是它在第110行之后基本上会停止执行,它有点随机但通常围绕这个数字。不确定我是否需要在某处设置超时或者我是否在代码中有错误。我注意到如果我将$max_connections
设置为大于8,则会返回500错误。有什么建议吗?
<?php
// CONFIG
$db_host = 'localhost';
$db_user = 'test';
$db_pass = 'yearight';
$db_name = 'URLS';
$excluded_domains = array();
$max_connections = 7;
$dbh = new PDO('mysql:host=localhost;dbname=URLS', $db_user, $db_pass);
$sth = $dbh->prepare("SELECT url FROM list");
$sth->execute();
$result = $sth->fetchAll(PDO::FETCH_COLUMN, 0);
// initialize some variables
$url_list = array();
$working_urls = array();
$dead_urls = array();
$not_found_urls = array();
$active = null;
foreach($result as $d) {
// get all links via regex
if (preg_match_all('@((http?://)?([-\w]+\.[-\w\.]+)+\w(:\d+)?(/([-\w/_\.]*(\?\S+)?)?)*)@', $d, $matches)) {
foreach ($matches[1] as $url) {
// store the url
$url_list []= $url;
}
}
}
// 1. multi handle
$mh = curl_multi_init();
// 2. add multiple URLs to the multi handle
for ($i = 0; $i < $max_connections; $i++) {
add_url_to_multi_handle($mh, $url_list);
}
// 3. initial execution
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
// 4. main loop
while ($active && $mrc == CURLM_OK) {
// 5. there is activity
if (curl_multi_select($mh) != -1) {
// 6. do work
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
// 7. is there info?
if ($mhinfo = curl_multi_info_read($mh)) {
// this means one of the requests were finished
// 8. get the info on the curl handle
$chinfo = curl_getinfo($mhinfo['handle']);
// 9. dead link?
if (!$chinfo['http_code']) {
$dead_urls []= $chinfo['url'];
// 10. 404?
} else if ($chinfo['http_code'] == 404) {
$not_found_urls []= $chinfo['url'];
// 11. working
} else {
$working_urls []= $chinfo['url'];
}
// 12. remove the handle
curl_multi_remove_handle($mh, $mhinfo['handle']);
curl_close($mhinfo['handle']);
// 13. add a new url and do work
if (add_url_to_multi_handle($mh, $url_list)) {
do {
$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
}
}
// 14. finished
curl_multi_close($mh);
echo "==Dead URLs==<br/>";
echo implode("<br/>",$dead_urls) . "<br/><br/>";
echo "==404 URLs==<br>";
echo implode("<br/>",$not_found_urls) . "<br/><br/>";
echo "==Working URLs==<br/>";
echo implode("<br/>",$working_urls);
echo "<pre>";
var_dump($url_list);
echo "</pre>";
// 15. adds a url to the multi handle
function add_url_to_multi_handle($mh, $url_list) {
static $index = 0;
// if we have another url to get
if ($url_list[$index]) {
// new curl handle
$ch = curl_init();
// set the url
curl_setopt($ch, CURLOPT_URL, $url_list[$index]);
// to prevent the response from being outputted
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// follow redirections
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
// do not need the body. this saves bandwidth and time
curl_setopt($ch, CURLOPT_NOBODY, 1);
// add it to the multi handle
curl_multi_add_handle($mh, $ch);
// increment so next url is used next time
$index++;
return true;
} else {
// we are done adding new URLs
return false;
}
}
?>
更新:
我在bash中编写了一个与此相同的脚本。我注意到当我浏览文本文件时输出信息,当它失败时,它通常是返回奇怪的http状态代码的链接,如000
和522
其中一些往往执行最多5分钟!所以我想知道cURL的PHP版本是否在遇到这些状态代码时停止执行。这只是一个想法,可能会增加更多价值来帮助解决问题。
答案 0 :(得分:-1)
1 - 执行时间问题
2 - 在代码顶部声明MAX_EXECUTION_TIME,将帮助确定
bool set_time_limit(int $ seconds)