我正在编写一个python脚本,使用urllib2
模块作为命令行实用程序wget
的等效项。我想要的唯一功能是它可以用于根据URL检索任意文件并将其保存到命名文件中。我还需要担心两个命令行参数,要从中下载文件的URL以及要将内容保存到的文件的名称。
示例:
python Prog7.py www.python.org pythonHomePage.html
这是我的代码:
import urllib
import urllib2
#import requests
url = 'http://www.python.org/pythonHomePage.html'
print "downloading with urllib"
urllib.urlretrieve(url, "code.txt")
print "downloading with urllib2"
f = urllib2.urlopen(url)
data = f.read()
with open("code2.txt", "wb") as code:
code.write(data)
urllib
似乎有效,但urllib2
似乎不起作用。
收到的错误:
File "Problem7.py", line 11, in <module>
f = urllib2.urlopen(url)
File "/usr/lib64/python2.6/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib64/python2.6/urllib2.py", line 397, in open
response = meth(req, response)
File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib64/python2.6/urllib2.py", line 429, in error
result = self._call_chain(*args)
File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib64/python2.6/urllib2.py", line 616, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib64/python2.6/urllib2.py", line 397, in open
response = meth(req, response)
File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib64/python2.6/urllib2.py", line 435, in error
return self._call_chain(*args)
File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib64/python2.6/urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: NOT FOUND
答案 0 :(得分:1)
网址根本不存在; https://www.python.org/pythonHomePage.html确实是404 Not Found页面。
urllib
和urllib2
之间的区别在于后者在返回404页面时会自动引发异常,而urllib.urlretrieve()
只会为您保存错误页面:
>>> import urllib
>>> urllib.urlopen('https://www.python.org/pythonHomePage.html').getcode()
404
>>> import urllib2
>>> urllib2.urlopen('https://www.python.org/pythonHomePage.html')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: NOT FOUND
如果你想保存错误页面,你可以抓住urllib2.HTTPError
exception:
try:
f = urllib2.urlopen(url)
data = f.read()
except urllib2.HTTPError as err:
data = err.read()
答案 1 :(得分:0)
这是由urllib和urllib2的不同行为造成的。 由于网页返回404错误(未找到网页)urllib2“捕获”它,而urllib下载返回页面的html,无论错误。 如果要将html打印到文本文件,可以打印错误:
import urllib2
try:
data = urllib2.urlopen('http://www.python.org/pythonHomePage.html').read()
except urllib2.HTTPError, e:
print e.code
print e.msg
print e.headers
print e.fp.read()
with open("code2.txt", "wb") as code:
code.write(e.fp.read())
req将是一个Request对象,fp将是一个类似文件的对象 HTTP错误体,代码将是错误的三位数代码,msg 将是用户可见的代码解释和hdrs将是一个 使用错误的标题映射对象。
有关HTTP错误的更多数据:urllib2 documentation