Python请求不正确下载pdf

时间:2019-10-10 03:45:32

标签: python python-3.x

在Windows 10上使用Python 3。 这是下载无法打开的PDF。 136KB而不是721KB。

我尝试了三种不同的打开PDF并将其写入文件的方式(请参见代码中的#1#,#2#和#3#。

我想知道问题是否出在身份验证上。我是身份验证的新手,但据我所知,该网站正在使用POST。

import requests

downloadurl = "https://pedsinreview.aappublications.org/content/pedsinreview/40/10/e35.full.pdf"

username = 'myusername'
password = 'mypassword'
chunk_size = 1024

payload = {'name': username, 'pass': password}
r = requests.get(downloadurl, data=payload, verify=False, stream=True)

#r.raw.decode_content = True
with open("file_name.pdf", 'wb') as f:
    #1#f.write(r.content)
    #2#shutil.copyfileobj(r.raw, f)  
    #3#for chunk in r.iter_content(chunk_size):
        #3#if chunk:
            #3#f.write(chunk) 

我可以打开721 KB PDF的预期输出,但得到的136 KB文件无法读取。

在此先感谢您的帮助。


更新:

作品!!!!!!!!!!!

import requests

loginurl = "https://pedsinreview.aappublications.org/user/login"
downloadurl = "https://pedsinreview.aappublications.org/content/pedsinreview/40/10/e35.full.pdf"

username = 'myusername'
password = 'mypassword'
chunk_size = 1024

#r = requests.get(downloadurl, data=payload, verify=False, stream=True)

# Do everything with the context of the session
with requests.Session() as session:
    data = {
        'form_id': 'user_login',
        'name': username,
        'pass': password
    }
    login_request = session.post(loginurl, data=data)
    print(login_request.status_code) # returns 200, I think it should be 302 because 
    #that's what it shows when I login successfully in browser vs. 200 when I use a 
    #wrong password.

    # Now you are logged in and should be able to request the pdf
    r = session.get(downloadurl)

with open("file_name.pdf", 'wb') as f:
    for chunk in r.iter_content(chunk_size):
        if chunk:
            f.write(chunk)

1 个答案:

答案 0 :(得分:0)

您认为它是身份验证问题是正确的。由于您尚未登录,因此服务器会将您重定向到您正在获取的html页面。

因此,第一件事,您将需要执行以下操作:

# Do everything with the context of the session
with requests.Session() as session:
    # Not sure if the last few are required, but I went to the site and looked at 
    # the login request and this is everything that was included.
    data = {
        'name': 'myusername',
        'pass': 'mypassword',
        'form_id': 'highwire_user_login',
        'form_build_id': 'form-yXL7wQkB-M6s7VkeYYQMBr0lPt8ICKc1ZFB5Qc-bOJ4'
        'op': 'Log in'
    }
    login_request = session.post("https://pedsinreview.aappublications.org/content/40/10/e35", data=data)
    print(login_request.status_code) # should be 200
    # Now you are logged in and should be able to request the pdf
    r = requests.get(downloadurl, verify=False, stream=True)
    ...