在Windows 10上使用Python 3。 这是下载无法打开的PDF。 136KB而不是721KB。
我尝试了三种不同的打开PDF并将其写入文件的方式(请参见代码中的#1#,#2#和#3#。
我想知道问题是否出在身份验证上。我是身份验证的新手,但据我所知,该网站正在使用POST。
import requests
downloadurl = "https://pedsinreview.aappublications.org/content/pedsinreview/40/10/e35.full.pdf"
username = 'myusername'
password = 'mypassword'
chunk_size = 1024
payload = {'name': username, 'pass': password}
r = requests.get(downloadurl, data=payload, verify=False, stream=True)
#r.raw.decode_content = True
with open("file_name.pdf", 'wb') as f:
#1#f.write(r.content)
#2#shutil.copyfileobj(r.raw, f)
#3#for chunk in r.iter_content(chunk_size):
#3#if chunk:
#3#f.write(chunk)
我可以打开721 KB PDF的预期输出,但得到的136 KB文件无法读取。
在此先感谢您的帮助。
更新:
作品!!!!!!!!!!!
import requests
loginurl = "https://pedsinreview.aappublications.org/user/login"
downloadurl = "https://pedsinreview.aappublications.org/content/pedsinreview/40/10/e35.full.pdf"
username = 'myusername'
password = 'mypassword'
chunk_size = 1024
#r = requests.get(downloadurl, data=payload, verify=False, stream=True)
# Do everything with the context of the session
with requests.Session() as session:
data = {
'form_id': 'user_login',
'name': username,
'pass': password
}
login_request = session.post(loginurl, data=data)
print(login_request.status_code) # returns 200, I think it should be 302 because
#that's what it shows when I login successfully in browser vs. 200 when I use a
#wrong password.
# Now you are logged in and should be able to request the pdf
r = session.get(downloadurl)
with open("file_name.pdf", 'wb') as f:
for chunk in r.iter_content(chunk_size):
if chunk:
f.write(chunk)
答案 0 :(得分:0)
您认为它是身份验证问题是正确的。由于您尚未登录,因此服务器会将您重定向到您正在获取的html页面。
因此,第一件事,您将需要执行以下操作:
# Do everything with the context of the session
with requests.Session() as session:
# Not sure if the last few are required, but I went to the site and looked at
# the login request and this is everything that was included.
data = {
'name': 'myusername',
'pass': 'mypassword',
'form_id': 'highwire_user_login',
'form_build_id': 'form-yXL7wQkB-M6s7VkeYYQMBr0lPt8ICKc1ZFB5Qc-bOJ4'
'op': 'Log in'
}
login_request = session.post("https://pedsinreview.aappublications.org/content/40/10/e35", data=data)
print(login_request.status_code) # should be 200
# Now you are logged in and should be able to request the pdf
r = requests.get(downloadurl, verify=False, stream=True)
...