Question

我需要从（https://www.sec.gov/litigation/suspensions.shtml）网站下载所有文件。它有1995年至2017年的数据，每年内部都有多个链接需要下载的文件。这些文件是.pdf，.htm和.txt格式。我尝试通过查看各种教程来搜索数据，但我需要做的是与通常的网络抓取教程不同。我使用了以下代码，但它没有达到我的目的。我是python的新手，我被困在这里如何前进。任何人都可以建议需要做些什么。

Answer 1

那应该做的工作。在Python 3.6上检查过，但代码应该与Python2.7兼容。主要想法是找到每年的链接，然后获取每年的pdf，htm和txt文件的所有链接。

from __future__ import print_function

import requests
from bs4 import BeautifulSoup


def file_links_filter(tag):
    """
    Tags filter. Return True for links that ends with 'pdf', 'htm' or 'txt'
    """
    if isinstance(tag, str):
        return tag.endswith('pdf') or tag.endswith('htm') or tag.endswith('txt')


def get_links(tags_list):
    return [WEB_ROOT + tag.attrs['href'] for tag in tags_list]


def download_file(file_link, folder):
    file = requests.get(file_link).content
    name = file_link.split('/')[-1]
    save_path = folder + name

    print("Saving file:", save_path)
    with open(save_path, 'wb') as fp:
        fp.write(file)


WEB_ROOT = 'https://www.sec.gov'
SAVE_FOLDER = '~/download_files/'  # directory in which files will be downloaded

r = requests.get("https://www.sec.gov/litigation/suspensions.shtml")

soup = BeautifulSoup(r.content, 'html.parser')

years = soup.select("p#archive-links > a")  # css selector for all <a> inside <p id='archive'> tag
years_links = get_links(years)

links_to_download = []
for year_link in years_links:
    page = requests.get(year_link)
    beautiful_page = BeautifulSoup(page.content, 'html.parser')

    links = beautiful_page.find_all("a", href=file_links_filter)
    links = get_links(links)

    links_to_download.extend(links)

# make set to exclude duplicate links
links_to_download = set(links_to_download)

print("Got links:", links_to_download)

for link in set(links_to_download):
    download_file(link, SAVE_FOLDER)

使用python从网站下载文件

1 个答案: