Question

我正在尝试使用pycurl下载一个tgz文件并使用tarfile提取它，但是没有将tgz文件存储在磁盘上并且没有将整个tgz文件存储在内存中。我想下载它，并以流的形式将其提取为大块。

我知道如何获取pycurl回调，每次下载新的数据块时都会给我数据：

def write(data):
    # Give data to tarfile to extract.
    ...

with contextlib.closing(pycurl.Curl()) as curl:
    curl.setopt(curl.URL, tar_uri)
    curl.setopt(curl.WRITEFUNCTION, write)
    curl.setopt(curl.FOLLOWLOCATION, True)
    curl.perform()

我也知道如何以流模式打开tarfile：

output_tar = tarfile.open(mode='r|gz', fileobj=fileobj)

但是我不知道如何将这两件事连接在一起，因此每次我在网上获得一块时，都会提取tar文件的下一块。

Answer 1

说实话，除非您真的在寻找纯Python解决方案（这是可能的，非常乏味），否则我建议只使用/usr/bin/tar并批量输入数据。

类似

import subprocess
p = subprocess.Popen(['/usr/bin/tar', 'xz', '-C', '/my/output/directory'], stdin=subprocess.PIPE)

def write(data):
    p.stdin.write(data)

with ...:
    curl.perform()

p.close()

Answer 2

仅使用 Python 的解决方案可能如下所示：

import contextlib
import tarfile
from http.client import HTTPSConnection


def https_download_tar(host, path, item_visitor, port=443, headers=dict({}), compression='gz'):
    """Download and unpack tar file on-the-fly and call item_visitor for each entry.

        item_visitor will receive the arguments TarFile (the currently extracted stream)
                       and the current TarInfo object
    """
    with contextlib.closing(HTTPSConnection(host=host, port=port)) as client:
        client.request('GET', path, headers=headers)
        with client.getresponse() as response:
            code = response.getcode()
            if code < 200 or code >= 300:
                raise Exception(f'HTTP error downloading tar: code: {code}')
            try:
                with tarfile.open(fileobj=response, mode=f'r|{compression}') as tar:
                    for tarinfo in tar:
                        item_visitor(tar, tarinfo)
            except Exception as e:
                raise Exception(f'Failed to extract tar stream: {e}')

# Test the download function using some popular archive
def list_entry(tar, tarinfo):
    print(f'{tarinfo.name}\t{"DIR" if tarinfo.isdir() else "FILE"}\t{tarinfo.size}\t{tarinfo.mtime}')

https_download_tar('dl.discordapp.net', '/apps/linux/0.0.15/discord-0.0.15.tar.gz', list_entry)

HTTPSConnection 用于提供响应流（类文件），然后将其传递给 tarfile.open()。

然后可以迭代 TAR 文件中的项目，例如使用 TarFile.extractfile() 提取它们。

下载并解压Python中的tar文件

2 个答案: