下载多个PDF时出现问题

时间:2019-10-24 13:04:09

标签: python pdf web-scraping pull-request

运行以下代码后,我无法打开下载的PDF。即使代码成功运行,下载的PDF文件也已损坏。

我的计算机的错误消息是

  

无法打开文件。可能已损坏或预览无法识别的格式。

为什么它们损坏了,我该如何解决?

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://github.com/sonhuytran/MIT8.01SC.2010F/tree/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual"

#If there is no such folder, the script will create one automatically
folder_location = r'/Users/rahelmizrahi/Desktop/ Physics_Solutions'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")     
for link in soup.select("a[href$='.pdf']"):

    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content) 

3 个答案:

答案 0 :(得分:0)

问题是打开/写入后文件未正确关闭。
只需在代码末尾添加f.close()即可。

答案 1 :(得分:0)

此问题是当您需要'blob'链接时,您正在请求github 'raw'中的链接:

'/sonhuytran/MIT8.01SC.2010F/blob/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf'

但您想要

'/sonhuytran/MIT8.01SC.2010F/raw/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf'

所以只需调整一下即可。完整代码如下:

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://github.com/sonhuytran/MIT8.01SC.2010F/tree/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual"

#If there is no such folder, the script will create one automatically
folder_location = r'/Users/rahelmizrahi/Desktop/Physics_Solutions'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")     
for link in soup.select("a[href$='.pdf']"):
    pdf_link = link['href'].replace('blob','raw')
    pdf_file = requests.get('https://github.com' + pdf_link)
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(pdf_file.content)

答案 2 :(得分:0)

我不得不使用soup.select(“ a [href $ =。pdf]”)(不带内引号)来使其正确选择链接。

此后,您的脚本起作用了,但是:您要下载的不是PDF,而是HTML网页!尝试访问以下网址之一:https://github.com/sonhuytran/MIT8.01SC.2010F/blob/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf

您将看到一个GitHub网页,而不是实际的PDF。为此,您需要“原始” GitHub URL,将其悬停在“下载”按钮上时可以看到:https://github.com/sonhuytran/MIT8.01SC.2010F/raw/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf

因此,看起来您只需要在适当的位置用blob替换raw即可使其起作用:

href = link['href']
href = href.replace('/blob/', '/raw/')
requests.get(urljoin(url,href).content)