如何自动从网站下载压缩文件

时间:2016-03-22 17:15:33

标签: python windows python-2.7 selenium selenium-webdriver

我需要从没有唯一URL地址的网站自动下载压缩文件。数据位于右下方相关下载的链接中。我没有任何python或任何脚本的经验,所以我需要一个可供新手使用的工具。我也知道自动化是否可以包含解压缩文件。

我将不胜感激任何协助/建议。

http://phmsa.dot.gov/pipeline/library/data-stats/distribution-transmission-and-gathering-lng-and-liquid-annual-data

1 个答案:

答案 0 :(得分:1)

您应该将BeautifulSouprequests视为起始位置。我会编写一个脚本,使用每天运行一次的脚本并检查新文件的zip文件链接。

import requests

from bs4 import BeautifulSoup

url = 'http://phmsa.dot.gov/pipeline/library/data-stats/distribution-transmission-and-gathering-lng-and-liquid-annual-data'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
all_hrefs = soup.find_all('a')
all_links = [link.get('href') for link in all_hrefs]
zip_files = [dl for dl in all_links if dl and '.zip' in dl]

这将为您提供该主登陆页面上所有zip文件的列表(假设扩展名始终为小写)。我只是将这些信息保存到SQLite数据库中,甚至只是将每个zip文件放在一行的纯文本文件中。然后,当您运行脚本时,它将使用上面的代码获取链接,打开数据库(或文本文件)并进行比较以查看其中是否有任何新内容。

如果找到新链接,则可以使用精彩的requests库下载。你需要这样的东西:

import os
import requests

root = 'http://phmsa.dot.gov/'
download_folder = '/path/to/download/zip/files/'

for zip_file in zip_files:
    full_url = root + zip_file
    r = requests.get(full_url)
    zip_filename = os.path.basename(zip_file)
    dl_path = os.path.join(download_folder, zip_filename)
    with open(dl_path, 'wb') as z_file:
        z_file.write(r.content)

这是一个完整的示例,每次运行时都会下载页面上的所有zip文件:

import os
import requests

from bs4 import BeautifulSoup

url = 'http://phmsa.dot.gov/pipeline/library/data-stats/distribution-transmission-and-gathering-lng-and-liquid-annual-data'
root = 'http://phmsa.dot.gov/'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

all_hrefs = soup.find_all('a')
all_links = [link.get('href') for link in all_hrefs]
zip_files = [dl for dl in all_links if dl and '.zip' in dl]
download_folder = '/home/mdriscoll/Downloads/zip_files'

if not os.path.exists(download_folder):
    os.makedirs(download_folder)

for zip_file in zip_files:
    full_url = root + zip_file
    r = requests.get(full_url)
    zip_filename = os.path.basename(zip_file)
    dl_path = os.path.join(download_folder, zip_filename)
    with open(dl_path, 'wb') as z_file:
        z_file.write(r.content)

更新#2 - 添加解压缩功能:

import os
import requests
import zipfile

from bs4 import BeautifulSoup

url = 'http://phmsa.dot.gov/pipeline/library/data-stats/distribution-transmission-and-gathering-lng-and-liquid-annual-data'
root = 'http://phmsa.dot.gov/'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

all_hrefs = soup.find_all('a')
all_links = [link.get('href') for link in all_hrefs]
zip_files = [dl for dl in all_links if dl and '.zip' in dl]
download_folder = '/home/mdriscoll/Downloads/zip_files'

if not os.path.exists(download_folder):
    os.makedirs(download_folder)

tries = 0
for zip_file in zip_files:
    full_url = root + zip_file
    zip_filename = os.path.basename(zip_file)
    dl_path = os.path.join(download_folder, zip_filename)
    if os.path.exists(dl_path):
        # you have already downloaded this file, so skip it
        continue

    while tries < 3:
        r = requests.get(full_url)
        dl_path = os.path.join(download_folder, zip_filename)
        with open(dl_path, 'wb') as z_file:
            z_file.write(r.content)

        # unzip the file
        extract_dir = os.path.splitext(os.path.basename(zip_file))[0]
        try:
            z = zipfile.ZipFile(dl_path)
            z.extractall(os.path.join(download_folder, extract_dir))
            break
        except zipfile.BadZipfile:
            # the file didn't download correctly, so try again
            # this is also a good place to log the error
            pass
        tries += 1

我在测试中注意到,文件偶尔会无法正确下载,我会收到BadZipFile错误,因此我添加了一些代码,尝试3次后再继续下载下载文件