我试图下载谷歌硬盘上托管的所有PDF幻灯片。收集的网址指向重定向到PDF的Google云端硬盘。当我尝试使用请求下载PDF时,它只下载HTML(122 KB)而不是二进制数据。
import os, sys, time, random
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://mila.umontreal.ca/en/cours/deep-learning-summer-school-2017/slides'
def download(url, name):
response = requests.get(url)
pdf = response.content
with open(name, 'wb') as f:
f.write(pdf)
browser = webdriver.Chrome()
browser.get(url)
browser.switch_to_frame(browser.find_element_by_class_name('iframe-class'))
links = browser.find_elements_by_css_selector('.flip-entry a')
titles = browser.find_elements_by_css_selector('.flip-entry-title')
pdfs = [link.get_attribute('href') for link in links]
names = [title.text for title in titles]
browser.quit()
for i, pdf in enumerate(pdfs): download(pdf, names[i])
答案 0 :(得分:1)
问题是您提取的链接是查看链接和下载链接。因此,当您下载该链接时,您将获得一个谷歌驱动程序的HTML,然后在浏览器中使用Javascript加载该文件。然后它会显示一个下载按钮供您下载文件
因此,您需要添加代码以将视图链接更改为下载链接
for i, pdf in enumerate(pdfs):
# get the doc id
doc_id = pdf.split("/")[-2]
download_url = "https://drive.google.com/uc?id={}&export=download".format(doc_id)
download(download_url, names[i])