我正在尝试从目录下载文件,所有文件url的唯一区别是中间的日期(https://eogdata.mines.edu/wwwdata/viirs_products/vnf/v30/VNF_j01_d20180607_noaa_v30-ez.csv.gz),我希望能够递增和迭代日期,以便将仅提供一个更改日期的网址。这是为了避免为代码提供超过500个网址。到目前为止,我只能下载一个文件。
import urllib.request
testfile = urllib.request.URLopener()
testfile.retrieve("https://eogdata.mines.edu/wwwdata/viirs_products/vnf/v30/VNF_j01_d20180607_noaa_v30-ez.csv.gz",
"C:/users/user 1/Desktop/20180607.gz")
答案 0 :(得分:0)
这似乎是一种有前途的方法(我不是专家)。它使用re
正则表达式模块来解析request.urlopen()
响应中的行,并查找用双引号引起来的带引号的文件名,双引号包含看起来像日期并以字符'.gz'
结尾的文件名:
import re
from urllib import request
from urllib.error import HTTPError
MAXLINES = 20 # To limit number of lines read - set to zero to disable.
directory = 'https://eogdata.mines.edu/wwwdata/viirs_products/vnf/v30'
pattern = re.compile(r""" "(\S*(\d{4} 0[1-9]|1[012] [012][0-9]|3[01])\S*\.gz)" """,
re.VERBOSE)
try:
with request.urlopen(directory) as response:
for i, line in enumerate(response, 1):
match = pattern.search(line.decode('utf-8'))
if match:
print(match.group(1)) # Print matching filename.
if MAXLINES and i > MAXLINES: # Stop early? (for testing)
break
except HTTPError as e:
print('Failed to open directory')
print('Reason: ', e.reason)