我有一个运行4个机器人的爬虫脚本同时, 每个机器人都在一个新选项卡中打开,通常会这样做:
1 /与db +的连接设置刮取所需的变量
2 /从db获取URL目标
3 /使用CURL或file_get_content获取内容
4 /用simple_html_dom设置“$ html”
5 /包括一个刮擦和操纵内容的“引擎”
6 /最后 - 检查它是否正常并优化内容并将其存储在数据库中
7 /为X链接做。在X链接刷新页面并继续爬行过程之后。
每件事都像魔术一样!但最近几分钟后(不是同一时间)所有的机器人都被停止了(没有错误闪过)有时只有3个......
有一个脚本设置每隔Y分钟刷新页面的时间间隔。如果它们被卡住了,那就是我的机器人工作了,但这不是这个问题的答案。
我检查了apache错误日志,但没有说明任何异常。
你有什么想法吗?
缩小代码:(带注释)
ini_set('user_agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.56 Safari/536.5');
error_reporting(E_ALL);
include ("peulot/stations1.php");//with connection and vars
include_once('simple_html_dom.php');
//DEFINE VALUES:
/*
here vars are declared and set
*/
echo "
<script language=javascript>
var int=self.setInterval(function(){ refresh2(); },".$protect.");
var counter;
function refresh2() {
geti();
link = 'store_url_beta.php?limit_link=".$limit_link."&storage_much=".$dowhile."&jammed=".($jammed_count+=1)."&bot=".$sbot."&counter=';
link = link+counter;
window.location=link;
}
function changecolor(answer)
{
document.getElementById(answer).style.backgroundColor = \"#00FF00\";
}
</script>";//this is the refresh if jammed
//some functions:
/*
function utf8_encode_deep --> for encoding
function hexbin --> for simhash fingerprint
function Charikar_SimHash --> for simhash fingerprint
function SimHashfingerprint --> for simhash fingerprint
*/
while ($i<=$dowhile)
{
//final values after crawling:
$link_insert="";
$p_ele_insert="";
$title_insert="";
$alt_insert="";
$h_insert="";
$charset="";
$text="";
$result_key="";
$result_desc="";
$note="";
///this connection is to check that there are links to crawl in data base... + grab the line for crawl.
$sql = "SELECT * FROM $table2 WHERE crawl='notyet' AND flag_avoid $regex $bot_action";
$rs_result = mysql_query ($sql);
$idr = mysql_fetch_array($rs_result);
unset ($sql);
unset ($rs_result);
set_time_limit(0);
$qwe++;
$target_url = $idr['live_link'];//set the link we are about to crawl now.
$matches_relate = $idr['relate'];//to insert at last
$linkid = $idr['id'];//link id to mark it as crawled in the end
$crawl_status = $idr['crawl'];//saving this to check if we update storage table or insert new row
$bybot_status = $idr['by_bot'];//saving this to check if we update storage table or insert new row
$status ="UPDATE $table2 SET crawl='working', by_bot='".$bot."', flag_avoid='$stat' WHERE id='$linkid'";
if(!mysql_query($status)) die('problem15');
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.56 Safari/536.5');
curl_setopt($ch, CURLOPT_URL, $target_url);
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$str = curl_exec($ch);
curl_close($ch);
if (strlen($str)<100)
{
//do it with file get content
}
if (strlen($html)>500)
{
require("engine.php");//GENERATE FATAL ERROR IF CRAWLER ENGINE AND PARSER NOT AVAILABLE
flush();//that will flush a result without any refresh
usleep(300);
//before inserting into table storage check if it was crawled before and then decide if to insert or update:
if ($crawl_status=="notyet"&&$bybot_status=="notstored")
{
//insert values
}
else
{
//update values
}
flush();//that will flush a result without any refresh
usleep(300);
if ($qwe>=$refresh) //for page refresh call
{
$secounter++;//counter for session
//optimize data
echo "<script type='text/javascript'>function refresh() { window.location='store_url_beta.php?limit_link=".$limit_link."&counter=".$i."&secounter=".$secounter."&storage_much=".$dowhile."&jammed=".$jammed."&bot=".$sbot."'; } refresh(); </script>";
}
}//end of if html is no empty.
else
{//mark a flag @4 and write title jammed!
//here - will update the table and note that its not possible to crawl
if ($qwe>=$refresh)
{
$secounter++;//counter for session
//optimize data
echo "<script type='text/javascript'>function refresh() { window.location='store_url_beta.php?limit_link=".$limit_link."&counter=".$i."&secounter=".$secounter."&storage_much=".$dowhile."&jammed=".$jammed."&bot=".$sbot."'; } refresh(); </script>";
}
}//end of else cant grab nothing
unset($html);
}//end of do while
mysql_close();
echo "<script language=javascript> window.clearInterval(int); </script>";
修改
经过无休止的测试和记录方法(按照杰克建议)我什么都没发现!
机器人停止时唯一发生的事情是在apache日志中:
[Thu Oct 25 01:01:33 2012] [error] [client 127.0.0.1] File does not exist: C:/wamp/www/favicon.ico
zend_mm_heap corrupted
[Thu Oct 25 01:01:51 2012] [notice] Parent: child process exited with status 1 -- Restarting.
[Thu Oct 25 01:01:51 2012] [notice] Apache/2.2.22 (Win64) mod_ssl/2.2.22 OpenSSL/1.0.1c PHP/5.3.13 configured -- resuming normal operations
[Thu Oct 25 01:01:51 2012] [notice] Server built: May 13 2012 19:41:17
[Thu Oct 25 01:01:51 2012] [notice] Parent: Created child process 736
[Thu Oct 25 01:01:51 2012] [warn] Init: Session Cache is not configured [hint: SSLSessionCache]
[Thu Oct 25 01:01:51 2012] [notice] Child 736: Child process is running
[Thu Oct 25 01:01:51 2012] [notice] Child 736: Acquired the start mutex.
[Thu Oct 25 01:01:51 2012] [notice] Child 736: Starting 200 worker threads.
[Thu Oct 25 01:01:51 2012] [notice] Child 736: Starting thread to listen on port 80.
[Thu Oct 25 01:01:51 2012] [notice] Child 736: Starting thread to listen on port 80.
[Thu Oct 25 01:01:51 2012] [error] [client 127.0.0.1] File does not exist: C:/wamp/www/favicon.ico
这条线是神秘的我真的不知道该怎么办请帮助我!
[Thu Oct 25 01:01:51 2012] [通知]父:子进程退出状态1 - 重新启动。
答案 0 :(得分:0)
找到这些问题的方法通常归结为普通的旧日志。
您应该让每个工作人员在可能长时间操作之前和之后将条目写入自己的日志文件,包括调试消息,行号,内存使用情况,以及您需要知道的任何内容;让它堵塞几次并分析日志。
如果有模式(即日志停止在同一点显示数据),您可以缩小搜索范围;如果没有,你可能会处理内存问题或其他致命的崩溃。
它还有助于追溯您最近在设置中可能已更改的内容,即使它似乎无关。