我正在尝试在https://occ.ca/our-publications的每一页上下载出版物。
我的最终目标是解析PDF文件中的文本并找到某些关键字。
到目前为止,我已经能够在所有页面上抓取PDF文件的链接。我已将这些链接保存到列表中。现在,我想浏览一下列表并使用Python下载所有pdf文件。下载文件后,我想通过它们进行解析。
这是我到目前为止使用的代码:
import requests
from bs4 import BeautifulSoup
import lxml
import csv
# This code adds all PDF links into a list called
#"publications".
publications=[]
for i in range(19):
response=requests.get('https://occ.ca/our-
publications/page/{}/'.format(i), headers={'User-
Agent': 'Mozilla'})
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'lxml')
pdfs = soup.findAll('div', {"class":
"publicationoverlay"})
links = [pdf.find('a').attrs['href'] for pdf in pdfs]
publications.append(links)
import urllib.request
for x in publications:
urllib.request.urlretrieve(x,'Publication_{}'.format(range(213)))
这是我在运行代码时遇到的错误。
回溯(最近通话最近): 文件“ C:\ Users \ plumm \ AppData \ Local \ Programs \ Python \ Python37 \ m.py”,第23行,在 urllib.request.urlretrieve(x,'Publication_ {} .pdf'.format(range(213))) 文件“ C:\ Users \ plumm \ AppData \ Local \ Programs \ Python \ Python37 \ lib \ urllib \ request.py”,第247行,位于urlretrieve中 使用contextlib.closing(urlopen(url,data))作为fp: urlopen中的文件“ C:\ Users \ plumm \ AppData \ Local \ Programs \ Python \ Python37 \ lib \ urllib \ request.py”,第222行 返回opener.open(URL,数据,超时) 打开的文件“ C:\ Users \ plumm \ AppData \ Local \ Programs \ Python \ Python37 \ lib \ urllib \ request.py”,第531行 响应= meth(req,响应) 文件“ C:\ Users \ plumm \ AppData \ Local \ Programs \ Python \ Python37 \ lib \ urllib \ request.py”,第641行,位于http_response中 'http',请求,响应,代码,msg,hdr) 文件“ C:\ Users \ plumm \ AppData \ Local \ Programs \ Python \ Python37 \ lib \ urllib \ request.py”,第569行,错误 返回self._call_chain(* args) _call_chain中的文件“ C:\ Users \ plumm \ AppData \ Local \ Programs \ Python \ Python37 \ lib \ urllib \ request.py”,第503行 结果= func(* args) 文件“ C:\ Users \ plumm \ AppData \ Local \ Programs \ Python \ Python37 \ lib \ urllib \ request.py”,第649行,位于http_error_default中 引发HTTPError(req.full_url,code,msg,hdrs,fp) urllib.error.HTTPError:HTTP错误403:禁止
答案 0 :(得分:1)
请尝试:
import requests
from bs4 import BeautifulSoup
import lxml
import csv
# This code adds all PDF links into a list called
#"publications".
publications=[]
for i in range(19):
response=requests.get('https://occ.ca/our-
publications/page/{}/'.format(i), headers={'User-
Agent': 'Mozilla'})
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'lxml')
pdfs = soup.findAll('div', {"class":
"publicationoverlay"})
links = [pdf.find('a').attrs['href'] for pdf in pdfs]
publications.extend(links)
for cntr, link in enumerate(publications):
print("try to get link", link)
rslt = requests.get(link)
print("Got", rslt)
fname = "temporarypdf_%d.pdf" % cntr
with open("temporarypdf_%d.pdf" % cntr, "wb") as fout:
fout.write(rslt.raw.read())
print("saved pdf data into ", fname)
# Call here the code that reads and parses the pdf.
答案 1 :(得分:0)
能否请您告知错误发生的行号?