Python:将大型网页保存到文件

时间:2011-11-22 00:56:01

标签: python file urllib2

首先我要说的是,我不是编程新手,但对python来说还是新手。

我使用urllib2编写了一个程序,它请求一个我想要保存到文件的网页。网页大约300KB,这对我来说并不特别大,但似乎足以给我带来麻烦,所以我称之为'大'。 我正在使用一个简单的调用直接从urlopen返回的对象复制到文件中:

file.write(webpage.read())

但它只会坐几分钟,试图写入文件,我最终收到以下内容:

Traceback (most recent call last):
  File "program.py", line 51, in <module>
    main()
  File "program.py", line 43, in main
    f.write(webpage.read())
  File "/usr/lib/python2.7/socket.py", line 351, in read
    data = self._sock.recv(rbufsize)
  File "/usr/lib/python2.7/httplib.py", line 541, in read
    return self._read_chunked(amt)
  File "/usr/lib/python2.7/httplib.py", line 592, in _read_chunked
    value.append(self._safe_read(amt))
  File "/usr/lib/python2.7/httplib.py", line 649, in _safe_read
    raise IncompleteRead(''.join(s), amt)
httplib.IncompleteRead: IncompleteRead(6384 bytes read, 1808 more expected)

我不知道为什么这会让节目如此悲痛?


编辑|

这是我如何检索页面

jar = cookielib.CookieJar()

cookie_processor = urllib2.HTTPCookieProcessor(jar);

opener = urllib2.build_opener(cookie_processor)
urllib2.install_opener(opener)

requ_login = urllib2.Request(LOGIN_PAGE,
                             data = urllib.urlencode( { 'destination' : "", 'username' : USERNAME, 'password' :  PASSWORD } ))

requ_page = urllib2.Request(WEBPAGE)    
try:
    #login
    urllib2.urlopen(requ_login)

    #get desired page
    portfolio = urllib2.urlopen(requ_page)
except urllib2.URLError as e:
    print e.code, ": ", e.reason

1 个答案:

答案 0 :(得分:5)

我使用shutil模块提供的方便fileobject copier function。它适用于我的机器:)

>>> import urllib2
>>> import shutil
>>> remote_fo = urllib2.urlopen('http://docs.python.org/library/shutil.html')
>>> with open('bigfile', 'wb') as local_fo:
...     shutil.copyfileobj(remote_fo, local_fo)
... 
>>> 

UPDATE:您可能希望将第3个参数传递给控制用于传输字节的内部缓冲区大小的copyfileobj

UPDATE2: shutil.copyfileobj.没什么好看的。它只是从源文件对象中读取一块字节并重复写入目标文件对象,直到没有其他内容可读。以下是我从Python标准库中获取的实际源代码:

def copyfileobj(fsrc, fdst, length=16*1024):
    """copy data from file-like object fsrc to file-like object fdst"""
    while 1:
        buf = fsrc.read(length)
        if not buf:
            break
        fdst.write(buf)