我正在尝试使用pycurl
下载一个tgz文件并使用tarfile
提取它,但是没有将tgz文件存储在磁盘上并且没有将整个tgz文件存储在内存中。我想下载它,并以流的形式将其提取为大块。
我知道如何获取pycurl回调,每次下载新的数据块时都会给我数据:
def write(data):
# Give data to tarfile to extract.
...
with contextlib.closing(pycurl.Curl()) as curl:
curl.setopt(curl.URL, tar_uri)
curl.setopt(curl.WRITEFUNCTION, write)
curl.setopt(curl.FOLLOWLOCATION, True)
curl.perform()
我也知道如何以流模式打开tarfile:
output_tar = tarfile.open(mode='r|gz', fileobj=fileobj)
但是我不知道如何将这两件事连接在一起,因此每次我在网上获得一块时,都会提取tar文件的下一块。
答案 0 :(得分:0)
说实话,除非您真的在寻找纯Python解决方案(这是可能的,非常乏味),否则我建议只使用/usr/bin/tar
并批量输入数据。
类似
import subprocess
p = subprocess.Popen(['/usr/bin/tar', 'xz', '-C', '/my/output/directory'], stdin=subprocess.PIPE)
def write(data):
p.stdin.write(data)
with ...:
curl.perform()
p.close()
答案 1 :(得分:0)
仅使用 Python 的解决方案可能如下所示:
import contextlib
import tarfile
from http.client import HTTPSConnection
def https_download_tar(host, path, item_visitor, port=443, headers=dict({}), compression='gz'):
"""Download and unpack tar file on-the-fly and call item_visitor for each entry.
item_visitor will receive the arguments TarFile (the currently extracted stream)
and the current TarInfo object
"""
with contextlib.closing(HTTPSConnection(host=host, port=port)) as client:
client.request('GET', path, headers=headers)
with client.getresponse() as response:
code = response.getcode()
if code < 200 or code >= 300:
raise Exception(f'HTTP error downloading tar: code: {code}')
try:
with tarfile.open(fileobj=response, mode=f'r|{compression}') as tar:
for tarinfo in tar:
item_visitor(tar, tarinfo)
except Exception as e:
raise Exception(f'Failed to extract tar stream: {e}')
# Test the download function using some popular archive
def list_entry(tar, tarinfo):
print(f'{tarinfo.name}\t{"DIR" if tarinfo.isdir() else "FILE"}\t{tarinfo.size}\t{tarinfo.mtime}')
https_download_tar('dl.discordapp.net', '/apps/linux/0.0.15/discord-0.0.15.tar.gz', list_entry)
HTTPSConnection
用于提供响应流(类文件),然后将其传递给 tarfile.open()
。
然后可以迭代 TAR 文件中的项目,例如使用 TarFile.extractfile()
提取它们。