urllib2根据URL检索任意文件并将其保存到命名文件中

时间:2014-10-13 09:16:12

标签: python urllib2

我正在编写一个python脚本,使用urllib2模块作为命令行实用程序wget的等效项。我想要的唯一功能是它可以用于根据URL检索任意文件并将其保存到命名文件中。我还需要担心两个命令行参数,要从中下载文件的URL以及要将内容保存到的文件的名称。

示例:

python Prog7.py www.python.org pythonHomePage.html

这是我的代码:

import urllib
import urllib2
#import requests

url = 'http://www.python.org/pythonHomePage.html'

print "downloading with urllib"
urllib.urlretrieve(url, "code.txt")

print "downloading with urllib2"
f = urllib2.urlopen(url)
data = f.read()
with open("code2.txt", "wb") as code:
   code.write(data)

urllib似乎有效,但urllib2似乎不起作用。

收到的错误:

 File "Problem7.py", line 11, in <module>
    f = urllib2.urlopen(url)
  File "/usr/lib64/python2.6/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib64/python2.6/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.6/urllib2.py", line 429, in error
    result = self._call_chain(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 616, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib64/python2.6/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.6/urllib2.py", line 435, in error
    return self._call_chain(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 518, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: NOT FOUND

2 个答案:

答案 0 :(得分:1)

网址根本不存在; https://www.python.org/pythonHomePage.html确实是404 Not Found页面。

urlliburllib2之间的区别在于后者在返回404页面时会自动引发异常,而urllib.urlretrieve()只会为您保存错误页面:

>>> import urllib
>>> urllib.urlopen('https://www.python.org/pythonHomePage.html').getcode()
404
>>> import urllib2
>>> urllib2.urlopen('https://www.python.org/pythonHomePage.html')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 410, in open
    response = meth(req, response)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 523, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 448, in error
    return self._call_chain(*args)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/urllib2.py", line 531, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: NOT FOUND

如果你想保存错误页面,你可以抓住urllib2.HTTPError exception

try:
   f = urllib2.urlopen(url)
   data = f.read()
except urllib2.HTTPError as err:
   data = err.read()

答案 1 :(得分:0)

这是由urllib和urllib2的不同行为造成的。 由于网页返回404错误(未找到网页)urllib2“捕获”它,而urllib下载返回页面的html,无论错误。 如果要将html打印到文本文件,可以打印错误:

import urllib2
try:
    data = urllib2.urlopen('http://www.python.org/pythonHomePage.html').read()
except urllib2.HTTPError, e:
    print e.code
    print e.msg
    print e.headers
    print e.fp.read()
    with open("code2.txt", "wb") as code:
      code.write(e.fp.read())
  

req将是一个Request对象,fp将是一个类似文件的对象   HTTP错误体,代码将是错误的三位数代码,msg   将是用户可见的代码解释和hdrs将是一个   使用错误的标题映射对象。

有关HTTP错误的更多数据:urllib2 documentation