我有一份列表,我想用Python快速下载。我该怎么做?这是清单:
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.fw001
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.pr001
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch001
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch002
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch003
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch004
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch005
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch006
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch007
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch008
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch009
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch010
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch011
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch012
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch013
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch014
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch015
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch016
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch017
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch018
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch019
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch020
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch021
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch022
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch023
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch024
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch025
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch026
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch027
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch028
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch029
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch030
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch031
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch032
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot001
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot002
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot003
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot004
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot005
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ix002
这对我来说有点不同。我以前只是在页面上有.pdf文件。无论如何,我的大学支付访问费用,我想快速下载整个内容。不是手动......
我尝试了以下操作,但是当我尝试在本地目录中打开PDF时,收到错误消息:
import urllib2
pdf_urls = [
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.fw001',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.pr001',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch001',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch002',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch003',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch004',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch005',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch006',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch007',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch008',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch009',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch010',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch011',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch012',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch013',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch014',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch015',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch016',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch017',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch018',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch019',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch020',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch021',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch022',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch023',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch024',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch025',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch026',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch027',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch028',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch029',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch030',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch031',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch032',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot001',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot002',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot003',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot004',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot005',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ix002'
]
for pdf_url in pdf_urls:
response = urllib2.urlopen('http://' + pdf_url)
output_name = pdf_url.rpartition('.')[2] + '.pdf'
output_file = open(output_name, 'wb')
output_file.write(output_name)
说实话,我不认为我知道我在做什么......
答案 0 :(得分:2)
您可以将requests库用于Python:
import requests
session = requests.Session()
pdf_urls = [
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.fw001',
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.pr001'
#and other files....
]
for pdf_url in pdf_urls:
r = session.get('http://' + pdf_url)
output_name = pdf_url.rpartition('.')[2] + '.pdf'
output_file = open(output_name, 'wb')
output_file.write(r.content)
此代码将文件保存在Python脚本所在的同一目录中。
修改强> 代码 urllib2 :
import urllib2
from cookielib import CookieJar
pdf_urls = [
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.fw001'
]
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
for pdf_url in pdf_urls:
response = opener.open('http://' + pdf_url)
content = response.read()
output_name = pdf_url.rpartition('.')[2] + '.pdf'
output_file = open(output_name, 'wb')
output_file.write(content)
一些解释:
首先,http://pubs.acs.org/
要求浏览器(或我们的python脚本)接受cookie。我们可以使用CookieJar
:
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
在循环中,我们遍历网址列表并下载文件:
response = opener.open('http://' + pdf_url)
content = response.read()
内容 包含单个pdf文件。保存。首先,为录制生成文件名。 rpartition返回一个三元素元组,
所以,例如,
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.fw001'.rpartition('.')[2]
给了我们fw001
。我们将.pdf
扩展名添加到文件名中。然后我们以二进制模式打开文件进行写入:
output_file = open(output_name, 'wb')
并写一个从网站
获得的pdf文件output_file.write(content)