PHP curl从caspio驱动的网站下载Zipped CSV

时间:2017-04-20 15:39:59

标签: php curl web-scraping zip

我需要从此网站下载压缩的.csv文件。 http://www.phrfsocal.org/web-lookup-2/该文件是右侧表格上方的下载数据链接。 问题是链接是动态创建的。所以我需要先提取它。

那部分似乎工作正常。我得到了href的这个链接。 https://b6.caspio.com/dp.asp?appSession=68982476236455965042483715808486764445346819370685922723164994812296661481433499615115137717633929851735433386281180144919150987&RecordID=&PageID=2&PrevPageID=&cpipage=&download=1

当我将该链接粘贴到新的浏览器标签中时,浏览器会下载包含我感兴趣的csv的zip文件。

但是当使用CURL来尝试获取zip时,它会获取链接下方表格的html。似乎无法弄清楚如何抓住.zip。 下面是我的代码,第一部分找到链接,似乎正在工作。

第二部分是我遇到麻烦的地方。

PS我已获得此页面所有者的许可,每晚使用Cron作业下载此数据。 提前致谢, 戴夫

$url = "http://www.phrfsocal.org/web-lookup-2/";

// url to the dynamic content doesn't seem to change.
$url = "https://b6.caspio.com/dp.asp?AppKey=0dc330000cbc1d03fd244fea82b4";

$header = get_web_page($url);
// Find the location of the Download Data link and extract the href      
$strpos = strpos($header['content'], 'Download Data');
$link = substr($header['content'], $strpos, 300);
$link = explode(" ", $link);
$link = explode('"', $link[2]);
$url1 = $link[1];

print_r($url1);
print "<p>";

// Now Go get the zip file.
$zipFile = "temp/SoCalzipfile.zip"; // Local Zip File Path
$zipResource = fopen($zipFile, "w+");
// Get The Zip File From Server
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url1);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_FILE, $zipResource);
$page = curl_exec($ch);
if (!$page) {
    echo "Error :- " . curl_error($ch);
}
curl_close($ch);

echo "zip file recieved";
/* Open the Zip file */
$zip = new ZipArchive;
$extractPath = "temp";
if ($zip->open($zipFile) != "true") {
    echo "Error :- Unable to open the Zip File";
}emphasized text
/* Extract Zip File */
$zip->extractTo($extractPath);
$zip->close();

1 个答案:

答案 0 :(得分:2)

以下代码将下载zip文件并将其解压缩到指定的文件夹中。确保该文件夹是可写的。因此,在此示例中,请确保临时文件夹具有写入权限。

您也不需要获取页面的html版本来提取链接。我玩了一下URL,你可以使用cpipage变量获取每个页面的zip文件。您可以更改$page_num变量以从指定页面获取zip。

$page_num = 1;

$url = 'https://b6.caspio.com/dp.asp?AppKey=0dc330000cbc1d03fd244fea82b4&RecordID=&PageID=2&PrevPageID=&cpipage=' .$page_num. '&download=1';

$zipFile = "temp/SoCalzipfile.zip"; // Local Zip File Path
$zipResource = fopen($zipFile, "w");
// Get The Zip File From Server
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($ch, CURLOPT_FILE, $zipResource);
$page = curl_exec($ch);
if(!$page) {
 echo "Error :- ".curl_error($ch);
}
curl_close($ch);


$zip = new ZipArchive;
$extractPath = "temp";
if($zip->open($zipFile) != "true"){
 echo "Error :- Unable to open the Zip File";
} 
/* Extract Zip File */
$zip->extractTo($extractPath);
$zip->close();