我正在编写一个使用正则表达式在页面上查找pdf链接的脚本,然后下载所述链接。该脚本在我的个人目录中运行并正确命名文件,但它没有下载完整的pdf文件。 pdf被拉,只有19kb,一个损坏的pdf,当他们应该大约15mb
import urllib, urllib2, re
url = 'http://www.website.com/Products'
destination = 'C:/Users/working/'
website = urllib2.urlopen(url)
html = website.read()
links = re.findall('.PDF">.*_geo.PDF', html)
for item in links:
DL = item[6:]
DL_PATH = url + '/' + DL
SV_PATH = destination + DL
urllib.urlretrieve(DL_PATH, SV_PATH)
url变量链接到包含所有pdf链接的页面。当您点击pdf链接时,它会转到“www.website.com/Products/NorthCarolina.pdf”,它会在浏览器中显示pdf。我不确定是否因为这个我应该使用不同的python方法或模块
答案 0 :(得分:5)
您可以尝试这样的事情:
import requests
links = ['link.pdf']
for link in links:
book_name = link.split('/')[-1]
with open(book_name, 'wb') as book:
a = requests.get(link, stream=True)
for block in a.iter_content(512):
if not block:
break
book.write(block)