Question

我正在尝试下载与研究人员相关的所需PDF。

但下载的PDF无法打开，说文件可能已损坏或格式错误。虽然测试中使用的另一个URL导致了普通的PDF文件。你有什么建议吗？

 import requests  
 from bs4 import BeautifulSoup  


 def download_file(url, index):  
     local_filename = index+"-"+url.split('/')[-1]  
     # NOTE the stream=True parameter  
     r = requests.get(url, stream=True)  
     with open(local_filename, 'wb') as f:  
         for chunk in r.iter_content(chunk_size=1024):  
             if chunk: # filter out keep-alive new chunks  
                 f.write(chunk)  
                 f.flush()  
     return local_filename  


 # For Test:   http://ww0.java4.datastructures.net/handouts/
 # Can't open: http://flyingv.ucsd.edu/smoura/publications.html

 root_link="http://ecal.berkeley.edu/publications.html#journals"

 r=requests.get(root_link)  
 if r.status_code==200:  
     soup=BeautifulSoup(r.text)  
     # print soup.prettify()  
     index=1  
     for link in soup.find_all('a'):  
         new_link=root_link+link.get('href')
         if new_link.endswith(".pdf"):  
             file_path=download_file(new_link,str(index))  
             print "downloading:"+new_link+" -> "+file_path  
             index+=1  
     print "all download finished"  
 else:  
     print "errors occur."

Answer 1

您的代码有评论说：

# Can't open: http://flyingv.ucsd.edu/smoura/publications.html

看起来你无法打开的是一个HTML文件。难怪PDF阅读器会抱怨它......

对于我遇到问题的任何真实PDF链接，我将按以下步骤操作：

使用其他方法（wget，curl，浏览器，...）下载文件。
- 你甚至可以下载吗？或者是否有一些密码箍可以跳过？
- 下载是否快速+完成？
然后在PDF查看器中打开吗？
- 如果是，请与脚本下载的文件进行比较。
  - 有什么区别？
  - 它们可能是由您的脚本引起的吗？
  - 前几百行内是否有差异，但后来有差异？文件的结尾是一堆nul-bytes？然后你的下载没有完成......
- 如果不是这样，仍然要比较差异。如果没有，那么您的脚本没有错。 PDF可能真的已损坏......
在文本编辑器中打开时的样子是什么？

无法打开下载的PDF文件

1 个答案: