如何更改代码,以便它也可以从其他pdf下载pdf

时间:2019-06-03 16:36:07

标签: python pdf beautifulsoup

我需要编码一些包含URL或PDF的内容,然后下载该页面上的所有PDF。到目前为止,当我放入网页时它可以工作,但是无法输入PDF。我对Python的了解很少,并且意识到这是因为BeautifulSoup仅适用于HTML和XML文件,所以我想知道是否有些东西对PDF做了同样的事情。

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = input("Please enter the URL ")
folder_location = input("Please enter the folder location(ie. C:\ExampleFolder) ")

#If there is no such folder, the script will create one automatically
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)

0 个答案:

没有答案