我可以通过一种方式从受控服务器下载内容 - 将文档ID传递到如下链接:
https://website/deployLink/442/document/download/$NUMBER
如果我在浏览器中导航到此页面,则会下载ID为$NUMBER
的文件。
问题是,我的服务器上有9,000个文件,这是SSL加密的,通常需要在网页上出现的对话框弹出窗口中使用用户名/密码登录。
我已经发布了类似的线程,我通过WGET下载了文件。现在我想尝试使用Python,我想提供用户名/密码并通过SSL加密。
这是我尝试抓取一个文件,导致401错误。完整的堆栈跟踪。
import urllib2
import ctypes
from HTMLParser import HTMLParser
# create a password manager
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password.
top_level_url = "https://website.com/home.html"
password_mgr.add_password(None, top_level_url, "admin", "password")
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
# create "opener" (OpenerDirector instance)
opener = urllib2.build_opener(handler)
# Install the opener.
# Now all calls to urllib2.urlopen use our opener.
urllib2.install_opener(opener)
# Grab website
response = urllib2.urlopen('https://website/deployLink/442/document/download/1')
html = response.read()
class MyHTMLParser(HTMLParser):
url=''https://website/deployLink/442/document/download/1')'
# Save the file
webpage = urllib2.urlopen(url)
with open('Test.doc','wb') as localFile:
localFile.write(webpage.read())
我在这里做错了什么?我正在尝试的是什么?
C:\Python27\python.exe C:/Users/ADMIN/PycharmProjects/GetFile.py
Traceback (most recent call last):
File "C:/Users/ADMIN/PycharmProjects/GetFile.py", line 22, in <module>
response = urllib2.urlopen('https://website/deployLink/442/document/download/1')
File "C:\Python27\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 437, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 475, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 401: Processed
使用退出代码1完成处理
这是我的真实页面,其中隐藏了一些信息:
验证网址以:443
结尾。
有人可以帮我调试一下并让它运行吗?
感谢。
答案 0 :(得分:1)
假设您的上述代码是准确的,那么我认为您的问题与add_password方法中的URI有关。设置用户名/密码时有这个:
# Add the username and password.
top_level_url = "https://website.com/home.html"
password_mgr.add_password(None, top_level_url, "admin", "password")
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
然后您的后续请求转到此URI:
# Grab website
response = urllib2.urlopen('https://website/deployLink/442/document/download/1')
(我假设他们已被“擦洗”错误,他们应该是相同的,然后继续前进。请参阅:“网站”与“website.com”)
第二个URI不是第一个URI的子节点,它们基于各自的路径部分。 URI路径/deployLink/442/document/download/1
不是/home.html
的子项。从库的角度来看,第二个URI有no auth data。