通过wget下载服务器上的最新文件

时间:2014-05-01 20:29:47

标签: linux bash unix ubuntu wget

Good Afternoon All,

我正在尝试弄清楚如何在Linux系统上使用wget从服务器下载最新文件。这些文件是5分钟的雷达数据,所以文件增加了5分钟直到最近,即1930.grib2,1935.grib2,1940.grib2等。

目前,我在我的bash脚本中实现了以下代码,从每小时开始下载每个文件,但这不是获取最新文件的有效方法:

HR=$(date +%H)
padtowidth=2
START=0
END=55
i=${START}

while [[ ${i} -le ${END} ]]
do

tau=$(printf "%0*d\n" $padtowidth ${i})

URL1=http://thredds.ucar.edu/thredds/fileServer/grib/nexrad/composite/unidata/files/${YMD}/Level_3_Composite_N0R_${YMD}_${HR}${tau}.grib2

wget -P ${HOMEDIR}${PATH1}${YMD}/${HR}Z/ -N ${URL1}

((i = i + 5))
done

2 个答案:

答案 0 :(得分:2)

如果您可以先下载所有文件的索引,然后解析它以查找最新文件。

如果无法做到这一点,您可以从当前时间开始向后计数(除了date +%M之外还使用date +%H)并在wget能够获取文件时停止(例如{ {1}}退出wget)。

希望它有所帮助!


解析索引的示例:

0

这将获取页面并通过快速filename=`wget -q -O - http://thredds.ucar.edu/thredds/catalog/grib/nexrad/composite/unidata/NEXRAD_Unidata_Reflectivity-20140501/files/catalog.html | grep '<a href=' | head -1 | sed -e 's/.*\(Level3_Composite_N0R_[0-9]*_[0-9]*.grib2\).*/\1/'` 运行包含<a href=的第一行以提取文件名。

答案 1 :(得分:0)

我为此自动创建了一个C ++控制台程序。我将在下面发布整个代码。只需使用wget捕获目录文件,然后在同一目录中运行它,它将自动创建一个BAT文件,您可以随意启动该BAT文件以下载最新文件。我是专门为Unidata THREDDS服务器编写的,因此我知道这是一个很好的答案。编辑和重要说明:这是最新的GOES-16数据,因此您必须处理不同产品的子字符串值。

#include <iostream>
#include <string>
#include <stdio.h>
#include <time.h>
#include <iostream>
#include <fstream>
#include <sstream>
using namespace std;


int main() 

{

// First, I open the catalog.html which was downloaded using wget, and put the entire file into a string.

ifstream inFile; // create instance
inFile.open("catalog.html"); // opens the file
stringstream strStream; // create stringstream
strStream << inFile.rdbuf();  //read the file
string str = strStream.str();  //str holds the content of the file

cout << str << endl;  // The string contains the entire catalog ... you can do anything with the string

// Now I will create the entire URL we need automatically by getting the base URL which is known (step 1 is : string "first")

string first= "http://thredds-test.unidata.ucar.edu/thredds/fileServer/satellite/goes16/GRB16/ABI/CONUS/Channel02/current/";

// The string "second" is the actual filename, since (to my knowledge) the filename in the HTML file never changes, but this must be watched in case it DOES change     in the future. I use the c++ substring function to extract it.

string second = str.substr(252784,76); 


// I then create a batch file and write "wget (base url + filename)" which can now automatically launch/download the latest GRIB2 file.

ofstream myfile2;
myfile2.open ("downloadGOESLatest.bat");
myfile2 << "wget ";
myfile2 << first;
myfile2 << second;
myfile2.close();


return 0;

}