如果在循环中,php curl返回400 Bad Request

时间:2010-11-15 07:58:34

标签: php curl

我正在尝试使用cUrl库进行屏幕刮擦。

我设法成功筛选了几个网址(5-10)。

然而,无论何时我在for循环中运行它来抓取批量(10-20)网址,

它将达到一个点,最后几个网址将返回“HTTP / 1.1 400 Bad Request”。 您的浏览器发送了此服务器无法理解的请求 请求标头字段的数量超出此服务器的限制。

我很确定网址是正确的并且正确修剪,并且标题长度是相同的。如果我将这些最后几个网址放在列表顶部以进行抓取,它确实会通过,但是列表的最后几个再次获得400 Bad请求错误。可能是什么问题呢?可能是什么原因?

有任何建议吗?

如下所示:


for($i=0;$i > sizeof($url);$i++)    
$data[$i] = $this->get($url[$i]); 



function get($url) {

$this->headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg';
        $this->headers[] = 'Connection: Keep-Alive';
        $this->headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';
        $this->user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12 (.NET CLR 3.5.30729)';

set_time_limit(EXECUTION_TIME_LIMIT);
        $default_exec_time = ini_get('max_execution_time');

        $this->redirectcount = 0;
        $process = curl_init($url);
        curl_setopt($process, CURLOPT_HTTPHEADER, $this->headers);
        curl_setopt($process, CURLOPT_HEADER, 1);
        curl_setopt($process, CURLOPT_USERAGENT, $this->user_agent);
        if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEFILE, $this->cookie_file);
        if ($this->cookies == TRUE) curl_setopt($process, CURLOPT_COOKIEJAR, $this->cookie_file);

        //off compression for debugging's sake
        //curl_setopt($process,CURLOPT_ENCODING , $this->compression);

        curl_setopt($process, CURLOPT_TIMEOUT, 180);
        if ($this->proxy) curl_setopt($process, CURLOPT_PROXY, $this->proxy);
        if ($this->proxyauth){ 
            curl_setopt($process, CURLOPT_HTTPPROXYTUNNEL, 1); 
            curl_setopt($process, CURLOPT_PROXYUSERPWD, $this->proxyauth);  
         }
        curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($process, CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($process,CURLOPT_MAXREDIRS,10); 

        //added
        //curl_setopt($process, CURLOPT_AUTOREFERER, 1);
        curl_setopt($process,CURLOPT_VERBOSE,TRUE);
        if ($this->referrer) curl_setopt($process,CURLOPT_REFERER,$this->referrer);

        if($this->cookies){
            foreach($this->cookies as $cookie){
                curl_setopt ($process, CURLOPT_COOKIE, $cookie);
                //echo $cookie; 
            }
        }

        $return = $this->redirect_exec($process);//curl_exec($process) or curl_error($process);
        curl_close($process);
        set_time_limit($default_exec_time);//setback to default

        return $return;
    }

    function redirect_exec($ch, $curlopt_header = false) {

    //curl_setopt($ch, CURLOPT_HEADER, true);
    //curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $data = curl_exec($ch);
    $file = fopen(DP_SCRAPE_DATA_CURL_DIR.$this->redirectcount.".html","w");
    fwrite($file,$data);
    fclose($file);

    $info =    curl_getinfo($ch);
    print_r($info);echo "
"; $http_code = $info['http_code']; if ($http_code == 301 || $http_code == 302 || $http_code == 303) { //list($header) = explode("\r\n\r\n", $data); //print_r($header); $matches = array(); //print_r($data); //Check if the response has a Location to redirect to preg_match('/(Location:|URI:)(.*?)\n/', $data, $matches); $url = trim(array_pop($matches)); //print_r($url); $url_parsed = parse_url($url); //print_r($url_parsed); if (isset($url_parsed['path']) && isset($url) && !empty($url) ) { //echo "
".$url; curl_setopt($ch, CURLOPT_URL, MY_HOST.$url); //echo "
".$url; $this->redirectcount++; return $this->redirect_exec($ch); //return $this->get(MY_HOST.$url); //$this->redirect_exec($ch); } } elseif($http_code == 200){ $matches = array(); preg_match('/(/i', $data, $matches); //print_r($matches); $url = trim(array_pop($matches)); //print_r($url); $url_parsed = parse_url($url); //print_r($url_parsed); if (isset($url_parsed['path']) && isset($url) && !empty($url) ) { curl_setopt($ch, CURLOPT_URL, $url); //echo "
".$url; $this->redirectcount++; sleep(SLEEP_INTERVAL); return $this->redirect_exec($ch); //return $this->get($url); //$this->redirect_exec($ch); } } //echo "data ".$data; $this->redirectcount++; return $data ; // $info['url']; }

其中$ urls是包含get请求的所有查询字符串的所有url

我从curl_getinfo中意识到,[request_size]变得越来越大,它应该是......它应该是大约相同的大小。如何打印/回显我的http请求信息以进行调试?

2 个答案:

答案 0 :(得分:5)

有关乘法标题的问题位于get方法的顶部:

$this->headers[] = 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg';
$this->headers[] = 'Connection: Keep-Alive';
$this->headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';

在每次迭代中,您将相同的标头添加到对象实例的headers数组中。 (假设array[]附加到数组。)您需要在每次迭代时重置数组,或者将标题设置移动到另一个方法中。

如果headers始终只在get方法中设置,则可以将其更改为此方法以解决问题:

$this->headers = array(
    'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg',
    'Connection: Keep-Alive',
    'Content-type: application/x-www-form-urlencoded;charset=UTF-8'
);

...但是如果标题总是相同而且从不在迭代之间改变,那么您也可以在对象构造函数中设置标题的值,并且只能在get方法中读取它,因为重置了数组到同一个值一直是多余的。

答案 1 :(得分:0)

CURLINFO_HEADER_OUT设置为true,我可以检索发送的请求信息。

实际上,请求标头获取的信息越来越多

我特别有这个标题递增!

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg
Connection: Keep-Alive
Content-type: application/x-www-form-urlencoded;charset=UTF-8
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg
Connection: Keep-Alive
Content-type: application/x-www-form-urlencoded;charset=UTF-8
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg
Connection: Keep-Alive
Content-type: application/x-www-form-urlencoded;charset=UTF-8
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, image/gif, image/x-bitmap, image/jpeg, image/pjpeg
Connection: Keep-Alive
Content-type: application/x-www-form-urlencoded;charset=UTF-8