从使用Beautiful Soup抓取的链接下载PDF

时间:2016-08-30 21:24:13

标签: python pdf beautifulsoup

我正在尝试编写一个脚本,该脚本将迭代csv文件中的着陆页网址列表,将目标网页上的所有PDF链接附加到列表中,然后遍历列表将PDF下载到指定的夹。

我有点卡在最后一步 - 我可以获得所有的PDF网址,但只能单独下载它们。我不确定如何最好地修改目录地址以更改每个URL以确保每个URL都有自己唯一的文件名。

任何帮助将不胜感激!

from bs4 import BeautifulSoup, SoupStrainer
import requests
import re

#example url
url = "https://beta.companieshouse.gov.uk/company/00445790/filing-history"
link_list = []
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")

for a in soup.find_all('a', href=True):
    if "document" in a['href']:
        link_list.append("https://beta.companieshouse.gov.uk"+a['href'])

for url in link_list:

    response = requests.get(url)

    with open('C:/Users/Desktop/CompaniesHouse/report.pdf', 'wb') as f:
        f.write(response.content)

1 个答案:

答案 0 :(得分:2)

最简单的方法是使用enumerate:

为每个文件名添加一个数字
for ind, url in enumerate(link_list, 1):
    response = requests.get(url)

    with open('C:/Users/Desktop/CompaniesHouse/report_{}.pdf'.format(ind), 'wb') as f:
        f.write(response.content)

但是假设每个路径都以 somne​​_filename.pdf 结尾并且它们是唯一的,您可以使用 basename 本身,这可能更具描述性:

from os.path import basename, join
for url in link_list:  
    response = requests.get(url)   
    with open(join('C:/Users/Desktop/CompaniesHouse", basename(url)), 'wb') as f:
        f.write(response.content)