如何使用Python从pubs.acs.org下载PDF?

时间:2014-05-23 08:21:10

标签: python

我有一份列表,我想用Python快速下载。我该怎么做?这是清单:

pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.fw001
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.pr001
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch001
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch002
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch003
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch004
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch005
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch006
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch007
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch008
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch009
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch010
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch011
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch012
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch013
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch014
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch015
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch016
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch017
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch018
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch019
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch020
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch021
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch022
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch023
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch024
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch025
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch026
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch027
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch028
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch029
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch030
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch031
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch032
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot001
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot002
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot003
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot004
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot005
pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ix002

这对我来说有点不同。我以前只是在页面上有.pdf文件。无论如何,我的大学支付访问费用,我想快速下载整个内容。不是手动......


我尝试了以下操作,但是当我尝试在本地目录中打开PDF时,收到错误消息:

import urllib2

pdf_urls = [
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.fw001', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.pr001', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch001', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch002', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch003', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch004', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch005', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch006', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch007', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch008', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch009', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch010', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch011', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch012', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch013', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch014', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch015', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch016', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch017', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch018', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch019', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch020', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch021', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch022', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch023', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch024', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch025', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch026', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch027', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch028', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch029', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch030', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch031', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ch032', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot001', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot002', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot003', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot004', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ot005', 
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.ix002'
]


for pdf_url in pdf_urls:
    response = urllib2.urlopen('http://' + pdf_url)
    output_name = pdf_url.rpartition('.')[2] + '.pdf'

    output_file = open(output_name, 'wb')
    output_file.write(output_name)

说实话,我不认为我知道我在做什么......

1 个答案:

答案 0 :(得分:2)

您可以将requests库用于Python:

import requests

session = requests.Session()

pdf_urls = [
    'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.fw001',
    'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.pr001'
    #and other files....
]


for pdf_url in pdf_urls:
    r = session.get('http://' + pdf_url)

    output_name = pdf_url.rpartition('.')[2] + '.pdf'
    output_file = open(output_name, 'wb')
    output_file.write(r.content)

此代码将文件保存在Python脚本所在的同一目录中。

修改 代码 urllib2

import urllib2
from cookielib import CookieJar

pdf_urls = [
'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.fw001'
]

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

for pdf_url in pdf_urls:

    response = opener.open('http://' + pdf_url)
    content = response.read()

    output_name = pdf_url.rpartition('.')[2] + '.pdf'
    output_file = open(output_name, 'wb')
    output_file.write(content)

一些解释:

首先,http://pubs.acs.org/要求浏览器(或我们的python脚本)接受cookie。我们可以使用CookieJar

来完成
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

在循环中,我们遍历网址列表并下载文件:

response = opener.open('http://' + pdf_url)
content = response.read()

内容 包含单个pdf文件。保存。首先,为录制生成文件名。 rpartition返回一个三元素元组,

  • 第一个元素是分隔符之前的字符串
  • 分隔符本身
  • 分隔符后面的字符串

所以,例如,

'pubs.acs.org/doi/pdf/10.1021/bk-2012-1093.fw001'.rpartition('.')[2]
给了我们fw001。我们将.pdf扩展名添加到文件名中。然后我们以二进制模式打开文件进行写入:

output_file = open(output_name, 'wb')

并写一个从网站

获得的pdf文件
output_file.write(content)