需要下载PDF,而不是网页的内容

时间:2017-12-04 21:52:10

标签: python python-2.7 proxy python-requests

因此,我可以获得PDF链接EXAMPLE OF THE LINK HERE网页的内容,但我不想要网页的内容,我想要PDF的内容,所以我可以将内容放在我的计算机上的PDF文件夹中。

我已经成功地在我不需要登录且没有代理服务器的网站上这样做。

相关代码:

import os
import urllib2
import time
import requests
import urllib3
from random import *


s = requests.Session()
data = {"Username":"username", "Password":"password"}
url = "https://login.url.com"

print "doing things"
r2 = s.post(url, data=data, proxies = {'https' : 'https://PROXYip:PORT'}, verify=False)

#I get a response 200 from printing r2
print r2


downlaod_url = "http://msds.walmartstores.com/client/document?productid=1000527&productguid=54e8aa24-0db4-4973-a81f-87368312069a&DocumentKey=undefined&HazdocumentKey=undefined&MSDS=0&subformat=NAM"

file = open("F:\my_filepath\document" + str(maxCounter) + ".pdf", 'wb')
temp = s.get(download_url, proxies = {'https' : 'https://PROXYip:PORT'}, verify=False)

#This prints out the response from the proxy server (i.e. 200)
print temp

something = uniform(5,6)
print something
time.sleep(something)

#This gets me the content of the web page, not the content of the PDF
print temp.content

file.write(temp.content)
file.close()

我需要帮助找出如何下载" PDF的内容

1 个答案:

答案 0 :(得分:2)

试试这个:

import requests

url = 'http://msds.walmartstores.com/client/document?productid=1000527&productguid=54e8aa24-0db4-4973-a81f-87368312069a&DocumentKey=undefined&HazdocumentKey=undefined&MSDS=0&subformat=NAM'

pdf = requests.get(url)
with open('walmart.pdf', 'wb') as file:
    file.write(pdf.content)

修改

再次尝试使用请求会话来管理cookie(假设他们在登录后发送给你们),也可能是另一个代理人

proxy_dict = {'https': 'ip:port'}

with requests.Session() as session:
    # Authentication request, use GET/POST whatever is needed
    # data variable should hold user/password information
    auth = session.get(login_url, data=data, proxies=proxy_dict, verify=False)
    if auth.status_code == 200:
        print(auth.cookies) # Tell me if you got anything
        pdf = auth.get('download_url')  # Were continuing the same session
        with open('walmart.pdf', 'wb') as file:
            file.write(pdf.content)
    else:
        print('No go, got {0} response'.format(auth.status_code))