我有一个ftp链接,其中包含一些我有兴趣下载的文件的链接:
ftp://lidar.wustl.edu/Phelps_Rolla/
我可以使用以下内容列出所有网址:
import urllib2
import BeautifulSoup
request = urllib2.Request("ftp://lidar.wustl.edu/Phelps_Rolla/")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
>>> soup
drwxrwxrwx 1 user group 0 Nov 7 2012 .
drwxrwxrwx 1 user group 0 Nov 7 2012 ..
drwxrwxrwx 1 user group 0 Nov 7 2012 ESRI_Grids
drwxrwxrwx 1 user group 0 Nov 7 2012 ESRI_Shapefiles
drwxrwxrwx 1 user group 0 Nov 7 2012 LAS_Files
-rw-rw-rw- 1 user group 545700 May 27 2011 LiDAR Accuracy Report_Rolla.pdf
drwxrwxrwx 1 user group 0 Nov 7 2012 Rolla Survey
-rw-rw-rw- 1 user group 4865 May 26 2011 Rolla_SEMA_Tile_Index.dbf
-rw-rw-rw- 1 user group 503 May 26 2011 Rolla_SEMA_Tile_Index.prj
-rw-rw-rw- 1 user group 188 May 26 2011 Rolla_SEMA_Tile_Index.sbn
-rw-rw-rw- 1 user group 124 May 26 2011 Rolla_SEMA_Tile_Index.sbx
-rw-rw-rw- 1 user group 1100 May 26 2011 Rolla_SEMA_Tile_Index.shp
-rw-rw-rw- 1 user group 12682 May 31 2011 Rolla_SEMA_Tile_Index.shp.xml
-rw-rw-rw- 1 user group 140 May 26 2011 Rolla_SEMA_Tile_Index.shx
如何只下载包含" Tile"的链接?或者" tile"使用扩展名" .dbf"," .prj"," .shp"和" .shx"?
答案 0 :(得分:4)
你正在使用urllib abd美丽的汤,但在处理FTP专用标准库模块ftplib时可能是更好的选择。前往文档并阅读如何连接到FTP以及打开连接和列表目录,那里有简单的步行槽。
下一步是弄清楚如何过滤你的文件,这是一些列表理解过滤字符串到那些内部有一些字符串的问题,例如:请参阅this question或this question。最后,您需要谷歌如何通过FTP下载文件,will find this question。事实证明,通过调用ftp.retrbinary()
进行文件下载。
这是一个简单的脚本,可以完成我上面提到的所有事情:
from ftplib import FTP
ftp = FTP("lidar.wustl.edu")
ftp.login()
ftp.cwd("Phelps_Rolla")
# list files with ftplib
file_list = ftp.nlst()
for f in file_list:
# apply your filters
if "tile" in f.lower() and any(f.endswith(ext) for ext in ['dbf', 'prj', 'shp', 'shx']):
# download file sending "RETR <name of file>" command
# open(f, "w").write is executed after RETR suceeds and returns file binary data
ftp.retrbinary("RETR {}".format(f), open(f, "wb").write)
print("downloaded {}".format(f))
ftp.quit()