我正在尝试下载与研究人员相关的所需PDF。
但下载的PDF无法打开,说文件可能已损坏或格式错误。虽然测试中使用的另一个URL导致了普通的PDF文件。你有什么建议吗?
import requests
from bs4 import BeautifulSoup
def download_file(url, index):
local_filename = index+"-"+url.split('/')[-1]
# NOTE the stream=True parameter
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush()
return local_filename
# For Test: http://ww0.java4.datastructures.net/handouts/
# Can't open: http://flyingv.ucsd.edu/smoura/publications.html
root_link="http://ecal.berkeley.edu/publications.html#journals"
r=requests.get(root_link)
if r.status_code==200:
soup=BeautifulSoup(r.text)
# print soup.prettify()
index=1
for link in soup.find_all('a'):
new_link=root_link+link.get('href')
if new_link.endswith(".pdf"):
file_path=download_file(new_link,str(index))
print "downloading:"+new_link+" -> "+file_path
index+=1
print "all download finished"
else:
print "errors occur."
答案 0 :(得分:0)
您的代码有评论说:
# Can't open: http://flyingv.ucsd.edu/smoura/publications.html
看起来你无法打开的是一个HTML文件。难怪PDF阅读器会抱怨它......
对于我遇到问题的任何真实PDF链接,我将按以下步骤操作:
wget
,curl
,浏览器,...)下载文件。