我正在尝试编写一个脚本,该脚本将迭代csv文件中的着陆页网址列表,将目标网页上的所有PDF链接附加到列表中,然后遍历列表将PDF下载到指定的夹。
我有点卡在最后一步 - 我可以获得所有的PDF网址,但只能单独下载它们。我不确定如何最好地修改目录地址以更改每个URL以确保每个URL都有自己唯一的文件名。
任何帮助将不胜感激!
from bs4 import BeautifulSoup, SoupStrainer
import requests
import re
#example url
url = "https://beta.companieshouse.gov.uk/company/00445790/filing-history"
link_list = []
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
for a in soup.find_all('a', href=True):
if "document" in a['href']:
link_list.append("https://beta.companieshouse.gov.uk"+a['href'])
for url in link_list:
response = requests.get(url)
with open('C:/Users/Desktop/CompaniesHouse/report.pdf', 'wb') as f:
f.write(response.content)
答案 0 :(得分:2)
最简单的方法是使用enumerate:
为每个文件名添加一个数字for ind, url in enumerate(link_list, 1):
response = requests.get(url)
with open('C:/Users/Desktop/CompaniesHouse/report_{}.pdf'.format(ind), 'wb') as f:
f.write(response.content)
但是假设每个路径都以 somne_filename.pdf 结尾并且它们是唯一的,您可以使用 basename 本身,这可能更具描述性:
from os.path import basename, join
for url in link_list:
response = requests.get(url)
with open(join('C:/Users/Desktop/CompaniesHouse", basename(url)), 'wb') as f:
f.write(response.content)