我需要从没有唯一URL地址的网站自动下载压缩文件。数据位于右下方相关下载的链接中。我没有任何python或任何脚本的经验,所以我需要一个可供新手使用的工具。我也知道自动化是否可以包含解压缩文件。
我将不胜感激任何协助/建议。
答案 0 :(得分:1)
您应该将BeautifulSoup和requests视为起始位置。我会编写一个脚本,使用每天运行一次的脚本并检查新文件的zip文件链接。
import requests
from bs4 import BeautifulSoup
url = 'http://phmsa.dot.gov/pipeline/library/data-stats/distribution-transmission-and-gathering-lng-and-liquid-annual-data'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
all_hrefs = soup.find_all('a')
all_links = [link.get('href') for link in all_hrefs]
zip_files = [dl for dl in all_links if dl and '.zip' in dl]
这将为您提供该主登陆页面上所有zip文件的列表(假设扩展名始终为小写)。我只是将这些信息保存到SQLite数据库中,甚至只是将每个zip文件放在一行的纯文本文件中。然后,当您运行脚本时,它将使用上面的代码获取链接,打开数据库(或文本文件)并进行比较以查看其中是否有任何新内容。
如果找到新链接,则可以使用精彩的requests库下载。你需要这样的东西:
import os
import requests
root = 'http://phmsa.dot.gov/'
download_folder = '/path/to/download/zip/files/'
for zip_file in zip_files:
full_url = root + zip_file
r = requests.get(full_url)
zip_filename = os.path.basename(zip_file)
dl_path = os.path.join(download_folder, zip_filename)
with open(dl_path, 'wb') as z_file:
z_file.write(r.content)
这是一个完整的示例,每次运行时都会下载页面上的所有zip文件:
import os
import requests
from bs4 import BeautifulSoup
url = 'http://phmsa.dot.gov/pipeline/library/data-stats/distribution-transmission-and-gathering-lng-and-liquid-annual-data'
root = 'http://phmsa.dot.gov/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
all_hrefs = soup.find_all('a')
all_links = [link.get('href') for link in all_hrefs]
zip_files = [dl for dl in all_links if dl and '.zip' in dl]
download_folder = '/home/mdriscoll/Downloads/zip_files'
if not os.path.exists(download_folder):
os.makedirs(download_folder)
for zip_file in zip_files:
full_url = root + zip_file
r = requests.get(full_url)
zip_filename = os.path.basename(zip_file)
dl_path = os.path.join(download_folder, zip_filename)
with open(dl_path, 'wb') as z_file:
z_file.write(r.content)
更新#2 - 添加解压缩功能:
import os
import requests
import zipfile
from bs4 import BeautifulSoup
url = 'http://phmsa.dot.gov/pipeline/library/data-stats/distribution-transmission-and-gathering-lng-and-liquid-annual-data'
root = 'http://phmsa.dot.gov/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
all_hrefs = soup.find_all('a')
all_links = [link.get('href') for link in all_hrefs]
zip_files = [dl for dl in all_links if dl and '.zip' in dl]
download_folder = '/home/mdriscoll/Downloads/zip_files'
if not os.path.exists(download_folder):
os.makedirs(download_folder)
tries = 0
for zip_file in zip_files:
full_url = root + zip_file
zip_filename = os.path.basename(zip_file)
dl_path = os.path.join(download_folder, zip_filename)
if os.path.exists(dl_path):
# you have already downloaded this file, so skip it
continue
while tries < 3:
r = requests.get(full_url)
dl_path = os.path.join(download_folder, zip_filename)
with open(dl_path, 'wb') as z_file:
z_file.write(r.content)
# unzip the file
extract_dir = os.path.splitext(os.path.basename(zip_file))[0]
try:
z = zipfile.ZipFile(dl_path)
z.extractall(os.path.join(download_folder, extract_dir))
break
except zipfile.BadZipfile:
# the file didn't download correctly, so try again
# this is also a good place to log the error
pass
tries += 1
我在测试中注意到,文件偶尔会无法正确下载,我会收到BadZipFile
错误,因此我添加了一些代码,尝试3次后再继续下载下载文件