Question

我在python中使用mechanize库来下载大文件。我正在使用mechanize来检索表单中的数据。

使用python同时下载太多文件的问题是我的系统内存（RAM）很快就会用完。

我能想到的减少内存使用的一种方法是下载文件的部分内容并将其保存到硬盘中。但我下载文件的互联网服务器使用HTTP / 1.0。因此，当我将Range标头添加到下载请求时，Range：bytes = 0-8192，服务器从第8192个字节返回开始文件。

我添加的标题是否有问题，或者HTTP / 1.0无法进行部分内容下载？

还有其他方法可以减少下载脚本的内存使用量吗？

这是下载文件的python代码：

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

webpage = <url>
br.addheaders = [("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20100101 Firefox/16.0"), ("Accept","text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),("Accept-Language","en-US,en;q=0.5"),("Accept-Encoding","gzip, deflate"),("DNT","1")]
br.open(webpage)

br.select_form(name='receive')

fl_nm = "test.pdf"

br.addheaders = [("Range", "bytes=0-8192")]
response = br.submit() # submits the form, just like if you clicked the submit button
fileObj = open(direc+'/'+fl_nm,"w") # open for write
fileObj.write(response.read())
fileObj.close()

Answer 1

响应就像一个文件句柄，所以你可以逐块读取它：

response = br.open('...')

with open('output.ext', 'wb') as handle:
    for chunk in iter((lambda: response.read(4096)), ''):
        handle.write(chunk)

因此，不是将整个文件读入内存然后将其写回，而是一次读取4096字节。

Answer 2

尝试类似：

def output_page(file_name, url, chunk=1024):
    f = open(file_name,'wb') # open file
    page = urllib.urlopen(url) # open webpage
    s = page.read(chunk) # read the first chunk
    while s: # once the page is read, s == ''
        f.write(s) # write data
        s = page.read(chunk) # and read the next chunk

在HTTP / 1.0中下载python中的文件

2 个答案: