Question

我可以通过一种方式从受控服务器下载内容 - 将文档ID传递到如下链接：

https://website/deployLink/442/document/download/$NUMBER

如果我在浏览器中导航到此页面，则会下载ID为$NUMBER的文件。

问题是，我的服务器上有9,000个文件，这是SSL加密的，通常需要在网页上出现的对话框弹出窗口中使用用户名/密码登录。

我已经发布了类似的线程，我通过WGET下载了文件。现在我想尝试使用Python，我想提供用户名/密码并通过SSL加密。

这是我尝试抓取一个文件，导致401错误。完整的堆栈跟踪。

import urllib2
import ctypes
from HTMLParser import HTMLParser

# create a password manager
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()

# Add the username and password.
top_level_url = "https://website.com/home.html"
password_mgr.add_password(None, top_level_url, "admin", "password")
handler = urllib2.HTTPBasicAuthHandler(password_mgr)

# create "opener" (OpenerDirector instance)
opener = urllib2.build_opener(handler)

# Install the opener.
# Now all calls to urllib2.urlopen use our opener.
urllib2.install_opener(opener)

# Grab website
response = urllib2.urlopen('https://website/deployLink/442/document/download/1')
html = response.read()

class MyHTMLParser(HTMLParser):

url=''https://website/deployLink/442/document/download/1')'


# Save the file
webpage = urllib2.urlopen(url)
with open('Test.doc','wb') as localFile:
     localFile.write(webpage.read())

我在这里做错了什么？我正在尝试的是什么？

C:\Python27\python.exe C:/Users/ADMIN/PycharmProjects/GetFile.py
Traceback (most recent call last):
  File "C:/Users/ADMIN/PycharmProjects/GetFile.py", line 22, in <module>
    response = urllib2.urlopen('https://website/deployLink/442/document/download/1')
  File "C:\Python27\lib\urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 437, in open
    response = meth(req, response)
  File "C:\Python27\lib\urllib2.py", line 550, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python27\lib\urllib2.py", line 475, in error
    return self._call_chain(*args)
  File "C:\Python27\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:\Python27\lib\urllib2.py", line 558, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 401: Processed

使用退出代码1完成处理

这是我的真实页面，其中隐藏了一些信息：

验证网址以:443结尾。

有人可以帮我调试一下并让它运行吗？

感谢。

Answer 1

假设您的上述代码是准确的，那么我认为您的问题与add_password方法中的URI有关。设置用户名/密码时有这个：

# Add the username and password.
top_level_url = "https://website.com/home.html"
password_mgr.add_password(None, top_level_url, "admin", "password")
handler = urllib2.HTTPBasicAuthHandler(password_mgr)

然后您的后续请求转到此URI：

# Grab website
response = urllib2.urlopen('https://website/deployLink/442/document/download/1')

（我假设他们已被“擦洗”错误，他们应该是相同的，然后继续前进。请参阅：“网站”与“website.com”）

第二个URI不是第一个URI的子节点，它们基于各自的路径部分。 URI路径/deployLink/442/document/download/1不是/home.html的子项。从库的角度来看，第二个URI有no auth data。

使用SSL运行GET并在Python中进行身份验证

1 个答案: