使用PHP,获取大文件URL的标头

时间:2015-04-15 16:06:23

标签: php http curl

我使用PHP使用数据库将数据从我们的一个站点提取到另一个站点。部分原因是我在HTML中找到文件时移动文件。

这方面的一个方面是需要检查该文件是否存在,以及它是否不是HTML(意味着有一个实际文件位于。

使用get_headers需要很长时间才能使用2.2MB PDF。尝试使用以下CURL请求执行相同的操作:

    public function getHeaders( $url ){
    $ch = curl_init();
    curl_setopt( $ch, CURLOPT_URL, $url );
    //curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );
    //curl_setopt( $ch, CURLOPT_VERBOSE, 0 );
    //curl_setopt( $ch, CURLOPT_HEADER, 1 );
    curl_setopt( $ch, CURLOPT_CUSTOMREQUEST, 'HEAD' );
    curl_exec( $ch );
    $info = curl_getinfo( $ch );
    curl_close( $ch );
    return $info;
}

这里的问题是,只需要很长时间(~20 +秒)就可以恢复标题。一旦我知道它是一个文件和200,那么我将返回并下载并将其插入我的新数据库。

有关如何让标题更好,更快的任何想法?感谢。

======编辑10:30a CDT 4/20/2015 ======

执行建议方法的示例代码:

<?php

//$file = 'http://www.pmi.org/Certification/~/media/PDF/Certifications/pdc_pmphandbook.ashx';
$file = 'https://www.projectmanagement-training.net/download/book_project_management.pdf';

print( 'Starting CURL Method : ' );
$time_start = microtime( true ); 
$headers = getHeaders( $file );
$execution_time = round( ( microtime( true ) - $time_start )/60, 8 );
print ( $execution_time . ' seconds <br />' );
print( '<pre>' . print_r( $headers, true ) . '</pre>' );



print( 'Starting get_headers() Method : ' );
$time_start = microtime( true ); 
$headers = get_headers( $file );
$execution_time = round( ( microtime( true ) - $time_start )/60, 8 );
print ( $execution_time . ' seconds <br />' );
print( '<pre>' . print_r( $headers, true ) . '</pre>' );



print( 'Starting get_headers() with context type Method : ' );
$time_start = microtime( true ); 
stream_context_set_default( array( 'http' => array( 'method' => 'HEAD', 'ignore_errors' => true ) ) );
$headers = get_headers( $file );
$execution_time = round( ( microtime( true ) - $time_start )/60, 8 );
print ( $execution_time . ' seconds <br />' );
print( '<pre>' . print_r( $headers, true ) . '</pre>' );



print( 'Starting file_get_contents Method : ' );
$time_start = microtime( true ); 
$context = stream_context_create( array( 'http' => array( 'method' => 'HEAD', 'ignore_errors' => true ) ) );
$file = file_get_contents( $file, false, $context );
$execution_time = round( ( microtime( true ) - $time_start )/60, 8 );
print ( $execution_time . ' seconds <br />' );
print( '<pre>' . print_r( $http_response_header, true ) . '</pre>' );











function getHeaders( $url ){
    $ch = curl_init();
    curl_setopt( $ch, CURLOPT_URL, $url );
    //curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );
    //curl_setopt( $ch, CURLOPT_VERBOSE, 0 );
    //curl_setopt( $ch, CURLOPT_HEADER, 1 );
    curl_setopt( $ch, CURLOPT_CUSTOMREQUEST, 'HEAD' );
    curl_exec( $ch );
    $info = curl_getinfo( $ch );
    curl_close( $ch );
    return $info;
}




?>

输出:

Starting CURL Method : 0.01373608 seconds 
Array
(
    [url] => https://www.projectmanagement-training.net/download/book_project_management.pdf
    [content_type] => 
    [http_code] => 0
    [header_size] => 0
    [request_size] => 0
    [filetime] => -1
    [ssl_verify_result] => 1
    [redirect_count] => 0
    [total_time] => 0.202
    [namelookup_time] => 0
    [connect_time] => 0.124
    [pretransfer_time] => 0
    [size_upload] => 0
    [size_download] => 0
    [speed_download] => 0
    [speed_upload] => 0
    [download_content_length] => -1
    [upload_content_length] => -1
    [starttransfer_time] => 0
    [redirect_time] => 0
    [redirect_url] => 
    [primary_ip] => 81.169.145.64
    [certinfo] => Array
        (
        )

    [primary_port] => 443
    [local_ip] => 127.0.0.1
    [local_port] => 62741
)
Starting get_headers() Method : 0.03559045 seconds 
Array
(
    [0] => HTTP/1.1 200 OK
    [1] => Date: Mon, 20 Apr 2015 15:28:28 GMT
    [2] => Server: Apache/2.2.29 (Unix)
    [3] => X-Powered-By: PHP/5.3.29
    [4] => Content-Disposition: attachment; filename="book_project_management.pdf"
    [5] => Content-Type: application/pdf
    [6] => Connection: close
)
Starting get_headers() with context type Method : 0.03277322 seconds 
Array
(
    [0] => HTTP/1.1 200 OK
    [1] => Date: Mon, 20 Apr 2015 15:28:30 GMT
    [2] => Server: Apache/2.2.29 (Unix)
    [3] => X-Powered-By: PHP/5.3.29
    [4] => Content-Disposition: attachment; filename="book_project_management.pdf"
    [5] => Content-Type: application/pdf
    [6] => Connection: close
)
Starting file_get_contents Method : 0.04345868 seconds 
Array
(
    [0] => HTTP/1.1 200 OK
    [1] => Date: Mon, 20 Apr 2015 15:28:33 GMT
    [2] => Server: Apache/2.2.29 (Unix)
    [3] => X-Powered-By: PHP/5.3.29
    [4] => Content-Disposition: attachment; filename="book_project_management.pdf"
    [5] => Content-Type: application/pdf
    [6] => Connection: close
)

3 个答案:

答案 0 :(得分:1)

如果您的目标是仅使用此函数获取标头,为什么不使用PHP内置? :)

http://php.net/manual/en/function.get-headers.php

答案 1 :(得分:0)

file_get_contents可能是一种更快捷的方式,因为选项允许您只返回标题信息:

<?php
    $url = "http://static.adzerk.net/Advertisers/831a088cf67e42c580e407e2d91c8ce6.jpg";

    $options = [
          'http' => [
               'method' => "HEAD",
               'ignore_errors' => 1
                ]
    ];

    $context = stream_context_create($options);
    $file = file_get_contents($url, false, $context);
    print_r($http_response_header);
?>

虽然如上所述,PHPs股票函数:http://php.net/manual/en/function.get-headers.php可能有诀窍:)

答案 2 :(得分:0)

在$ info数组中检查这些时间。这些将告诉你时间花在哪里:

CURLINFO_NAMELOOKUP_TIME
CURLINFO_CONNECT_TIME
CURLINFO_PRETRANSFER_TIME
CURLINFO_STARTTRANSFER_TIME
CURLINFO_SPEED_DOWNLOAD
CURLINFO_TOTAL_TIME

测试这两个站点的链接:

http://www.webpagetest.org/
http://gtmetrix.com/


如果使用get_headers()设置get_headers() stream_context_set_default()的默认值

get_headers()使用stream_context_set_default(),因此这是一个有效选项。

   stream_context_set_default(
        array(
            'http' => array(
                'method' => 'HEAD'
            )
        )
    );
    $headers = get_headers('http://example.com');

RE:curl

你不会得到这一行的标题:“

//curl_setopt( $ch, CURLOPT_HEADER, 1 );

此外,您无法检索响应标头所在的数据:

设置以下选项:

curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
curl_setopt($ch, CURLOPT_VERBOSE, true);

您需要添加超时,并在出错时启用失败:

curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($ch, CURLOPT_TIMEOUT,10);
curl_setopt($ch, CURLOPT_FAILONERROR,true);
curl_setopt($ch, CURLOPT_ENCODING,"");



$data = curl_exec($ch);

if (curl_errno($ch)){
    $info['error'] = curl_error($ch);
}
else {
  $skip = intval(curl_getinfo($ch, CURLINFO_HEADER_SIZE)); 
  $requestHeader= substr($data,0,$skip);
  $info = curl_getinfo($ch);
  $info['requestHeader'] = $requestHeader;
}
return $info;