我有以下代码从textarea字段输入url和proxies,使用curl获取源代码,从页面获取某些链接并将它们插入到数据库中。这适用于一个网址,但在我为多个网址/代理添加代理和两个循环后无效。现在它只是超时没有错误消息,并说它找不到文件。我从proxy-list.org获得代理。任何指针都将受到赞赏。
<html>
<body>
<?
$urls=explode("\n", $_POST['url']);
$proxies=explode("\n", $_POST['proxy']);
$allurls=count($urls);
$allproxies=count($proxies);
for ( $counter = 0; $counter <= $allurls; $counter++) {
for ( $count = 0; $count <= $allproxies; $count++) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$urls[$counter]);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 0);
curl_setopt($ch, CURLOPT_PROXY,$proxies[$count]);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST,'GET');
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_exec ($ch);
$curl_scraped_page=curl_exec($ch);
//use the new tool box
require "ToolBoxA4.php";
//call the new function parseA1
$arrOut = parseA1 ($curl_scraped_page);
//the output is an array with 3 items: $arrOut[0] is RHS, $arrOut[1] is TOP, $arrOut[2] is NAT
//to look at the RHS
//$arrLookAt = explode(",", $arrOut[0]);
//print_r ($arrLookAt);
//echo "<br><hr><br>";
//foreach ($arrLookAt as $value){
// echo $value;
// echo "<br>";
//}
$FileName = abs(rand(0,1000000000000));
$FileHandle = fopen($FileName, 'w') or die("can't open file");
fwrite($FileHandle, $curl_scraped_page);
//$dom = new DOMDocument();
//@$dom->loadHTML($curl_scraped_page);
//$xpath = new DOMXPath($doc);
//$hrefs = $xpath->query('//a[@href][@id]');
$hostname="****";
$username="****";
$password="****";
$dbname="****";
$usertable="****";
$con=mysql_connect($hostname,$username, $password) or die ("<html><script language='JavaScript'>alert('Unable to connect to database! Please try again later.'),history.go(-1)</script></html>");
mysql_select_db($dbname ,$con);
//function storeLink($url) {
// $query = "INSERT INTO **** (time, ad1, ad2) VALUES ('$FileName','$url', '$gathered_from')";
// mysql_query($query) or die('Error, insert query failed');
//}
//for ($i = 0; $i < $hrefs->length; $i++) {
// $href = $hrefs->item($i);
// $url = $href->getAttribute('href');
// storeLink($url);
//
//}
//function storeLink($top, $right) {
//$query = "INSERT INTO happyturtle (time, ad1, ad2) VALUES ('$FileName','$top', '$right')";
//mysql_query($query) or die('Error, insert query failed');
$right = explode(",", $arrOut[0]);
$top = explode(",", $arrOut[1]);
for ( $countforme = 0; $countforme <= 5; $countforme++) {
$topnow=$top[$countforme];
$query = "INSERT INTO **** (time, ad1) VALUES ('$FileName','$topnow')";
mysql_query($query) or die('Error, insert query failed');
}
for ( $countforme = 0; $countforme <= 15; $countforme++) {
$rightnow = $right[$countforme];
$query = "INSERT INTO **** (time, ad1) VALUES ('$FileName','$rightnow')";
mysql_query($query) or die('Error, insert query failed');
}
mysql_close($con);
fclose($FileHandle);
curl_close($ch);
//echo $FileName;
//echo "<br/>";
}
}
?>
</body>
</html>
答案 0 :(得分:0)
您的代码将依次获取每个URL,因此可能需要很长时间才能运行。一种可能的解决方案是使用cURL“多”接口,允许多个请求同时运行 - http://www.php.net/manual/en/function.curl-multi-exec.php
另一种替代方法是增加您正在使用的服务器上的PHP超时(如果这实际上是批处理过程)。有关此问题的信息位于http://php.net/manual/en/function.set-time-limit.php
我要做的一个观察是公共代理(例如来自proxy-list.org的代理)响应速度非常慢,因为您要从多个位置请求,只要最慢的代理服务器执行,脚本将始终使用响应(可能比服务器的PHP超时设置更长)。