如何刮取SSL或HTTPS URL

时间:2015-07-01 13:25:25

标签: php curl web-scraping

我已经编写了一个使用CURL抓取网站的功能,但是在调用时它没有返回任何内容并且无法理解原因。输出为空

  <?php
    function scrape($url)
    {
        $headers = Array(
                    "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
                    "Cache-Control: max-age=0",
                    "Connection: keep-alive",
                    "Keep-Alive: 300",
                    "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7",
                    "Accept-Language: en-us,en;q=0.5",
                    "Pragma: "
                );
        $config = Array(
                        CURLOPT_RETURNTRANSFER => TRUE ,
                        CURLOPT_FOLLOWLOCATION => TRUE ,
                        CURLOPT_AUTOREFERER => TRUE ,
                        CURLOPT_CONNECTTIMEOUT => 120 ,
                        CURLOPT_TIMEOUT => 120 ,
                        CURLOPT_MAXREDIRS => 10 ,                   
                        CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8" ,
                        CURLOPT_URL => $url ,
                       ) ;
        $handle = curl_init() ;
        curl_setopt_array($handle,$config) ;
        curl_setopt($handle,CURLOPT_HTTPHEADER,$headers) ;
        $data = curl_exec($handle) ;
        curl_close($handle) ;
        return $data ;
    }

    echo scrape("https://www.google.com") ;
?>

1 个答案:

答案 0 :(得分:5)

尝试刮取ssl或https网址时有两种可能的修复方法:

  1. 快速修复
  2. 正确修复
  3. 快速修复,首先。

    警告:这可能会引入SSL旨在防范的安全问题。

    设置:CURLOPT_SSL_VERIFYPEER => false

    第二个,正确的解决方法。设置3个选项:

    1. CURLOPT_SSL_VERIFYPEER => true
    2. CURLOPT_SSL_VERIFYHOST => 2
    3. CURLOPT_CAINFO => getcwd() . '\CAcert.pem'
    4. 您需要做的最后一件事是下载CA证书。

      转到, - http://curl.haxx.se/docs/caextract.html - &gt;点击'cacert.pem' - &gt;将文本复制/粘贴到文本编辑器中 - &gt;将文件另存为“CAcert.pem”。检查它不是“CAcert.pem。 txt

      <?php
          function scrape($url)
          {
              $headers = Array(
                          "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
                          "Cache-Control: max-age=0",
                          "Connection: keep-alive",
                          "Keep-Alive: 300",
                          "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7",
                          "Accept-Language: en-us,en;q=0.5",
                          "Pragma: "
                      );
              $config = Array(
                              CURLOPT_SSL_VERIFYPEER => true,
                              CURLOPT_SSL_VERIFYHOST => 2,
                              CURLOPT_CAINFO => getcwd() . '\CAcert.pem',
                              CURLOPT_RETURNTRANSFER => TRUE ,
                              CURLOPT_FOLLOWLOCATION => TRUE ,
                              CURLOPT_AUTOREFERER => TRUE ,
                              CURLOPT_CONNECTTIMEOUT => 120 ,
                              CURLOPT_TIMEOUT => 120 ,
                              CURLOPT_MAXREDIRS => 10 ,                   
                              CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8" ,
                              CURLOPT_URL => $url
                             ) ;
              $handle = curl_init() ;
              curl_setopt_array($handle,$config) ;
              curl_setopt($handle,CURLOPT_HTTPHEADER,$headers) ;
              $output->data = curl_exec($handle) ;
      
              if(curl_exec($handle) === false) {
                  $output->error = 'Curl error: ' . curl_error($handle);
              } else {
                  $output->error = 'Operation completed without any errors';
              }
      
              curl_close($handle) ;
              return $output ;
          }
      
      $scrape = scrape("https://www.google.com") ;
      
      echo $scrape->data;
      
      //uncomment for errors
      //echo $scrape->error;
      ?>