通过PHP从ieeexplore.ieee.org下载PDF

时间:2012-07-09 22:23:11

标签: php cookies curl

我试图通过PHP从ieeexplore下载pdf文件,但似乎效果不佳。假设网址为http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5534992。我写了以下PHP代码:

function get_web_page($url) {
    $ch  = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_COOKIESESSION, true);
    curl_setopt($ch, CURLOPT_COOKIEJAR, '/tmp/cookie.txt');
    curl_setopt($ch, CURLOPT_COOKIEFILE, '/tmp/cookie.txt');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_HEADER, 0);        
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    $page = curl_exec($ch);
    curl_close($ch);
    return $page;
}

但此代码失败,没有下载任何内容。我检查了以下收到的http标头:

HTTP/1.0 200
Connection established
HTTP/1.1 302 Moved Temporarily
Server: Sun-ONE-Web-Server/6.1
Date: Mon, 09 Jul 2012 22:11:50 GMT
Content-length: 0
Content-type: text/html
Set-Cookie: ERIGHTS=na2vLnqZwz9xxRfO2zN8Ny66f0vHi85YE*ynGx2BtGx2FmIHkiEyx2Bg89Db6Qx3Dx3D-18x2dHeJj2k3B7UHsoix2BefrHXeAx3Dx3Dusln2oQUqj3KXiQXjOYx2BMwx3Dx3D-UQmTydx2FMwnGJOyKUw5iVDAx3Dx3D-eV0zE6ztXYKrVZluJrMMbAx3Dx3D;path=/;domain=.ieee.org
Location: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5534992&tag=1
Set-Cookie: WLSESSION=874668684.20480.0000; expires=Tue, 10-Jul-2012 22:11:48 GMT; path=/

HTTP/1.1 200 OK
Server: Sun-ONE-Web-Server/6.1
Date: Mon, 09 Jul 2012 22:11:50 GMT
Content-length: 203
Content-type: text/html; charset=UTF-8
Cache-Control: private
Product: 254
Inst: 9690
Licenseowner: 9690
Member: 0
Cache-Control: no-cache
Pragma: No-cache
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Set-Cookie: xploreCookies={"standardsLicenseId":"0","openUrl":"http://linkserv.lib.utk.edu:9003/sfx","enterpriseLicenseId":"0","isIp":"true","desktopReportingUrl":"null","openUrlImgLoc":"http://www.lib.utk.edu/eresources/sfx2.gif","products":"IEL|VDE|","contactName":"NA","isChargebackUser":"false","contactEmail":"NA","oldSessionKey":"na2vLnqZwz9xxRfO2zN8Ny66f0vHi85YE*ynGx2BtGx2FmIHkiEyx2Bg89Db6Qx3Dx3D-18x2dHeJj2k3B7UHsoix2BefrHXeAx3Dx3Dusln2oQUqj3KXiQXjOYx2BMwx3Dx3D-UQmTydx2FMwnGJOyKUw5iVDAx3Dx3D-eV0zE6ztXYKrVZluJrMMbAx3Dx3D","userIds":"9690","instImage":"","isInst":"true","isDelegatedAdmin":"false","isMember":"false","instName":"UNIVERSITY OF TENNESSEE","customerSurvey":"NA","smallBusinessLicenseId":"0","openUrlTxt":"NA"}; domain=.ieee.org; path=/
Set-Cookie: JSESSIONID=V6pLP7XH4nvtQYcvmVc1ry1Y51vDHhkG8SGn9y0LG8XJv3k3hmJs!-1711984930; path=/; HttpOnly
X-Powered-By: Servlet/2.5 JSP/2.1

An error has occurred while trying to load your document. Please try again. If you continue to experience issues, please contact Customer Service.2016

However, if you paste the URL in the web browser, you may access the PDF file directly.

由于我在大学的领域,我不需要担心此PDF文件的访问许可。

有人有想法吗?

感谢〜

1 个答案:

答案 0 :(得分:1)

测试此代码                                                                                这里过去的代理ip(如果你没有填写它,将不会使用代理)                 值

        </tr>
        <tr>
            <td> For connceting to Database paste Its Url</td>
            <td><input type="text" name="ssurl" value="http://www.sciencedirect.com/science/article/pii/S0301421504000928" /></td> /></td>

        </tr>
        <tr>
            <td>Proxy Ip</td>
            <td><input type="text" name="ssproxyip" value="202.202.0.163"/></td>

        </tr>
        <tr>
            <td>Proxyport</td>
            <td><input type="text" name="ssproxyport" value="3128"/></td>

        </tr>
        <tr>
            <td>Proxy Username & password   (username:password)</td>
            <td><input type="text" name="ssproxyusernamepassword"/></td>

        </tr>
    </table>
    <input type="submit" name="ssurlsubmit" value="submit" />
    </form>



</body>
</html>

<?php

/**
 * @author nnnnn
 * @copyright 2012
 */
//removes string from the end of other
if (isset($_POST['ssurlsubmit'])) {
function removeFromEnd($string, $stringToRemove) {
     $stringToRemoveLen = strlen($stringToRemove);
     $stringLen = strlen($string);

     $pos = $stringLen - $stringToRemoveLen;    

     $out = substr($string, 0, $pos);

     return $out;
 }

//$string = 'picture.jpg.jpg';
//$string = removeFromEnd($string, '.jpg');
//$url='http://127.0.0.1/leech/';
//$url='http://www.sciencedirect.com/science/article/pii/S0301421504000928';


global $ssurl,$file_n;
$url=$_POST['ssurl'];
$file_n=$_POST['file_name'];
echo "URL:".$url.'<br />';


//$url = 'http://pdn.sciencedirect.com/science?_ob=MiamiImageURL&_cid=271097&_user=2501846&_pii=S0301421504000928&_check=y&_origin=article&_zone=toolbar&_coverDate=2005--31&view=c&originContentFamily=serial&wchp=dGLbVlB-zSkWb&md5=f51bc09e08b4d3eafb759ef5c08724c4&pid=1-s2.0-S0301421504000928-main.pdf';

 if (isset($_POST['ssproxyip'])) {
$proxyip=$_POST['ssproxyip'];
$proxyoprt=$_POST['ssproxyport'];


}
if (isset($_POST['proxyuserpassword'])) {
$proxyuserpassword=$_POST['ssproxyuserpassword'];

}

$mypath = getcwd();
            $mypath = preg_replace('/\\\\/', '/', $mypath);
            $rand = rand(1, 15000);

            if (!file_exists("$mypath/cookies") and !is_dir("$mypath/cookies")) {
mkdir("$mypath/cookies");
} 

            $cookie_file_path = "$mypath/cookies/cookie$rand.txt";
            echo $cookie_file_path.'<br />';
            echo 'cookie2:  '.$cookie_file_path.'<br/>';
if (! file_exists($cookie_file_path) || ! is_writable($cookie_file_path))
{

 //$fp1 = fopen($cookie_file_path, "w");
  //fclose($fp1);
  }
  if (! file_exists($cookie_file_path) || ! is_writable($cookie_file_path))
  {
    echo 'Cookie file missing or not writable.';
    //exit;
}
if ( ! extension_loaded('curl'))
{
    echo "You need to load/activate the curl extension.";
}


 $ss=substr($url,-4);
 $string = removeFromEnd($url, '.pdf');
 echo "ss:  ".$ss.'<br />'; 
 //$url = 'http://www.sciencedirect.com/science/jrnlallbooks/a/fulltext';
//$proxy = '200.93.148.72:3128';
$ext = substr($fileName, strrpos($fileName, '.') + 1);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
//curl_setopt($ch, CURLOPT_PROXY, "203.64.181.50"); //your proxy url
//curl_setopt($ch, CURLOPT_PROXYPORT, "3128"); // your proxy port number 
//curl_setopt($ch, CURLOPT_PROXYUSERPWD, "bjm:12345"); //username:pass 
//curl_setopt($ch, CURLOPT_PROXY, "202.202.0.163"); //your proxy url
//curl_setopt($ch, CURLOPT_PROXYPORT, "3128"); // your proxy port number 

if (isset($_POST['ssproxyip'])) {
curl_setopt($ch, CURLOPT_PROXY, $proxyip); //your proxy url
curl_setopt($ch, CURLOPT_PROXYPORT, $proxyport); // your proxy port number 

}
if (isset($_POST['proxyuserpassword'])) {
curl_setopt($ch, CURLOPT_PROXYUSERPWD,$proxyuserpassword); //username:pass 
}
curl_setopt($ch, CURLOPT_TIMEOUT, 0);
//curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);


curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file_path);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file_path);
curl_setopt($ch, CURLOPT_COOKIESESSION, true);



curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; Media Center PC 4.0)');
//curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
//curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
//curl_setopt($curl, CURLOPT_VERBOSE, 1);
/*
*/
//echo $string;
echo substr($url,-4).'<br />';
//echo $url;   
$proxy = $proxyip.':'.$proxyoprt;
echo 'proxyip:   '.$proxyip.'<br />';
echo 'proxy:  '.$proxy.'<br />';
$timeout = 5;
$splited = explode(':',$proxy); // Separate IP and port
echo 'splited:   '.$splited.'<br />';
echo $splited[0].'<br />';
echo $splited[1].'<br />';
/*
//if($con = @fsockopen($splited[0], $splited[1], $errorNumber, $errorMessage, $timeout)) 
if($con = @fsockopen($proxyip, $proxyoprt, $errorNumber, $errorMessage, $timeout)) 
{
    echo 'Connection successful, PROXY works!'.'<br />';
} else {
 echo 'Connection FAILED, PROXY FAIL!'.'<br />';
    echo $errorNumber .'<br />';
    echo ' ' . $errorMessage.'<br />';
}
*/
echo "if not run".'<br />';
echo $ss.'<br />';
if ($ss=='.zip' || $ss=='r.gz' ||  $ss=='.pdf') 
 { 
 echo "if runed".'<br />';
 ini_set('max_allowed_packet', '164M');
 ini_set('mysql.wait_timeout', 600);


 ini_set('max_execution_time', '200');

 ini_set('mysql.reconnect', 'On');
  ini_set('mysql.connect_timeout', 300);
ini_set('default_socket_timeout', 300);

 $file = basename($url);    
//$file_extection=extension($url);

echo "file:  ".$file.'<br />';
echo "url:   ".$url.'<br />' ;
$dir= $file;

//    $url  = 'http://www.example.com/a-large-file.zip';
    //$path = $_SERVER['DOCUMENT_ROOT'] . '/downloads/'.$file;
    $path = $_SERVER[$url] . $file;
     if (isset($file_n )&& strlen($file_n)>0) {
    $path=$file_n;
    }
    echo "path:   ".$path.'<br />' ;
        $fp = fopen($path, 'w');
//$fp = fopen(basename($url).'zip', 'w+');
/**
* Ask cURL to write the contents to a file
*/
curl_setopt($ch, CURLOPT_FILE, $fp);
$curl_scraped_page = curl_exec($ch);
$file = 'file.pdf';
$fileName = 'fileName.pdf';
file_put_contents($file, $curl_scraped_page);
file_put_contents($path, $curl_scraped_page);

fclose($fp);

header('Content-type: application/pdf');
header('Content-Disposition: inline; filename="' . $filename . '"');
header('Content-Transfer-Encoding: binary');
header('Content-Length: ' . filesize($file));
header('Accept-Ranges: bytes');

readfile($file);

echo "File DONE".'<br />';
}else {
    echo "curl_scraped_page   ".$curl_scraped_page.'<br />' ;


$curl_scraped_page = curl_exec($ch);
$file = 'file.pdf';
$fileName = 'fileName.pdf';
file_put_contents($file, $curl_scraped_page);
file_put_contents($path, $curl_scraped_page);


header('Content-type: application/pdf');
header('Content-Disposition: inline; filename="' . $filename . '"');
header('Content-Transfer-Encoding: binary');
header('Content-Length: ' . filesize($file));
header('Accept-Ranges: bytes');
}
/*
$file = $url; // URL to the file

$contents = file_get_contents($file); // read the remote file

touch('somelocal.pdf'); // create a local EMPTY copy

file_put_contents('somelocal.pdf', $contents); // put the fetchted data into the newly created file
*/
curl_close($ch);
echo 'curl_close'.'<br />';
echo $curl_scraped_page.'<br />';
}

?>

您可以在此链接进行测试: http://apr9ss.tk/curl-proxy-down-work2.php