我正在尝试从网站(http://cis-ca.org/islamscience1.php)下载期刊期刊。我运行了一些命令来获取此页面上的所有PDF。但是,这些PDF内部具有链接到另一个PDF的链接。
我想从所有PDF链接中获取终端文章。
从页面http://cis-ca.org/islamscience1.php
中获取所有PDF。import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "http://cis-ca.org/islamscience1.php"
#If there is no such folder, the script will create one automatically
folder_location = r'webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
我想将这些文章链接到这些PDF中。 预先感谢
答案 0 :(得分:0)
https://mamclain.com/?page=Blog_Programing_Python_Removing_PDF_Hyperlinks_With_Python
看看这个链接。它显示了如何识别超链接和清理PDF文档。您可以将其跟随到标识部分,然后执行存储超链接的操作,而不是进行消毒。
或者,看看这个库:https://github.com/metachris/pdfx