如何从网站下载html的副本?

时间:2012-12-28 02:17:00

标签: php mysql curl file-get-contents

如何从具有语言检测功能的网站(例如google,youtube)和重定向下载html副本?我已经尝试过file_get_contents,但它是限制的。

我正在尝试使用php中的curl从www.google.com获取html,但它检测到我来自英国并向我发送了302重定向到www.google.co.uk。

我尝试了许多不同的事情,没有快乐,这可能吗?像www.markosweb.com这样的网站就是这样做的。

我的代码:

$ch  = curl_init( "http://www.google.com/" );    

// $userAgent = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)";
//  $userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)';

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';


$header = array(
         "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
         "Accept-Language: en-US,us;q=0.7,en-us;q=0.5,en;q=0.3",
         "Accept-Charset: windows-1251,utf-8;q=0.7,*;q=0.7",
         "Keep-Alive: 300");

curl_setopt($ch,CURLOPT_RETURNTRANSFER,TRUE); //TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,5); //The number of seconds to wait while trying to connect. 

curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); //The contents of the "User-Agent: " header to be used in a HTTP request.
curl_setopt($ch, CURLOPT_FAILONERROR, TRUE); //To fail silently if the HTTP code returned is greater than or equal to 400.
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); //To follow any "Location: " header that the server sends as part of the HTTP header.
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE); //To automatically set the Referer: field in requests where it follows a Location: redirect.
curl_setopt($ch, CURLOPT_TIMEOUT, 10); //The maximum number of seconds to allow cURL functions to execute.  
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);

curl_setopt($curl, CURLOPT_REFERER, $url);

curl_setopt($ch, CURLOPT_HTTPHEADER, 0);

$content = curl_exec( $ch );    
$err     = curl_errno( $ch );    
$errmsg  = curl_error( $ch );    
$header  = curl_getinfo( $ch );    
curl_close( $ch );    
$header['errno']   = $err;    
$header['errmsg']  = $errmsg;    
$header['content'] = $content;    
return $header;

我尝试将useragent更改为很多东西,尝试使用和不使用标题详细信息。如果我使用标题信息,我设法获得了一些东西:“接受语言:ru-ru,ru; q = 0.7,en-us; q = 0.5,en; q = 0.3”但它是俄语或其他东西。

感谢您的帮助。 卡尔

1 个答案:

答案 0 :(得分:1)

试试这个代理脚本:

// Change these configuration options if needed, see above descriptions for info.
$enable_jsonp    = false;
$enable_native   = false;
$valid_url_regex = '/.*/';

// ############################################################################

$url = $_GET['url'];

if ( !$url ) {

  // Passed url not specified.
  $contents = 'ERROR: url not specified';
  $status = array( 'http_code' => 'ERROR' );

} else if ( !preg_match( $valid_url_regex, $url ) ) {

  // Passed url doesn't match $valid_url_regex.
  $contents = 'ERROR: invalid url';
  $status = array( 'http_code' => 'ERROR' );

} else {
  $ch = curl_init( $url );

  if ( strtolower($_SERVER['REQUEST_METHOD']) == 'post' ) {
    curl_setopt( $ch, CURLOPT_POST, true );
    curl_setopt( $ch, CURLOPT_POSTFIELDS, $_POST );
  }

  if ( $_GET['send_cookies'] ) {
    $cookie = array();
    foreach ( $_COOKIE as $key => $value ) {
      $cookie[] = $key . '=' . $value;
    }
    if ( $_GET['send_session'] ) {
      $cookie[] = SID;
    }
    $cookie = implode( '; ', $cookie );

    curl_setopt( $ch, CURLOPT_COOKIE, $cookie );
  }

  curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
  curl_setopt( $ch, CURLOPT_HEADER, true );
  curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );

  curl_setopt( $ch, CURLOPT_USERAGENT, $_GET['user_agent'] ? $_GET['user_agent'] : $_SERVER['HTTP_USER_AGENT'] );

  list( $header, $contents ) = preg_split( '/([\r\n][\r\n])\\1/', curl_exec( $ch ), 2 );

  $status = curl_getinfo( $ch );

  curl_close( $ch );
}

// Split header text into an array.
$header_text = preg_split( '/[\r\n]+/', $header );

if ( $_GET['mode'] == 'native' ) {
  if ( !$enable_native ) {
    $contents = 'ERROR: invalid mode';
    $status = array( 'http_code' => 'ERROR' );
  }

  // Propagate headers to response.
  foreach ( $header_text as $header ) {
    if ( preg_match( '/^(?:Content-Type|Content-Language|Set-Cookie):/i', $header ) ) {
      header( $header );
    }
  }

  print $contents;

} else {

  // $data will be serialized into JSON data.
  $data = array();

  // Propagate all HTTP headers into the JSON data object.
  if ( $_GET['full_headers'] ) {
    $data['headers'] = array();

    foreach ( $header_text as $header ) {
      preg_match( '/^(.+?):\s+(.*)$/', $header, $matches );
      if ( $matches ) {
        $data['headers'][ $matches[1] ] = $matches[2];
      }
    }
  }

  // Propagate all cURL request / response info to the JSON data object.
  if ( $_GET['full_status'] ) {
    $data['status'] = $status;
  } else {
    $data['status'] = array();
    $data['status']['http_code'] = $status['http_code'];
  }

  // Set the JSON data object contents, decoding it from JSON if possible.
  $decoded_json = json_decode( $contents );
  $data['contents'] = $decoded_json ? $decoded_json : $contents;

  // Generate appropriate content-type header.
  $is_xhr = strtolower($_SERVER['HTTP_X_REQUESTED_WITH']) == 'xmlhttprequest';
  header( 'Content-type: application/' . ( $is_xhr ? 'json' : 'x-javascript' ) );

  // Get JSONP callback.
  $jsonp_callback = $enable_jsonp && isset($_GET['callback']) ? $_GET['callback'] : null;

  // Generate JSON/JSONP string
  $json = json_encode( $data );

  print $jsonp_callback ? "$jsonp_callback($json)" : $json;

}

确保执行以下请求: http://example.com/script?url=http://whateverurl.com/

哦,这个PHP脚本将以JSON显示结果。 从那里,你可以使用jQuery解析它。

就像我使用这个jQuery代码一样:

   <script type="text/javascript">
$(document).ready(function(){
var url='+++++URL WHICH THE PHP PROXY SCRIPT IS IN++++++';
$(window).load(function(){
        $.getJSON(url,function(json){
               $("#resu").append(""+json.contents+"");
        });
    });
});
</script>

编辑:此脚本不是真正的代理,因为它伪造了IP地址。对不起,感到困惑。