我需要从HTTP服务器获取.tar.gz存档并执行它包含的每个文件的MD5sum。 由于存档是4.5GB压缩,12GB解压缩,我不想触摸硬盘驱动器。当然,我也无法将所有内容保存在RAM中。
我试图使用python,但我的问题是,由于一些奇怪的原因,tarfile模块尝试搜索()到输入文件句柄的末尾 - 这是你可以'用管道流来做。想法?
import tarfile
import hashlib
import subprocess
URL = 'http://myhost/myfile.tar.gz'
url_fh = subprocess.Popen('curl %s | gzip -cd' % URL, shell=True, stdout=subprocess.PIPE)
tar_fh = tarfile.open(mode='r', fileobj=url_fh.stdout)
for tar_info in tar_fh:
content_fh = tar_fh.extractfile(tar_info)
print hashlib.md5(content_fh.read()).hexdigest(), tar_info.name
tar_fh.close()
上述内容失败了:
Traceback (most recent call last):
File "gzip_pipe.py", line 13, in <module>
tar_fh = tarfile.open(mode='r', fileobj=url_fh.stdout)
File "/algo/algos2dev4/AlgoOne-EC/third-party-apps/python/lib/python2.6/tarfile.py", line 1644, in open
saved_pos = fileobj.tell()
IOError: [Errno 29] Illegal seek
答案 0 :(得分:3)
即时查找远程档案中所有文件的md5总和:
#!/usr/bin/env python
import tarfile
import sys
import hashlib
from contextlib import closing
from functools import partial
try:
from urllib.request import urlopen
except ImportError: # Python 2
from urllib2 import urlopen
def md5sum(file, bufsize=1<<15):
d = hashlib.md5()
for buf in iter(partial(file.read, bufsize), b''):
d.update(buf)
return d.hexdigest()
url = sys.argv[1] # url to download
with closing(urlopen(url)) as r, tarfile.open(fileobj=r, mode='r|*') as archive:
for member in archive:
if member.isreg(): # extract only regular files from the archive
with closing(archive.extractfile(member)) as file:
print("{name}\t{sum}".format(name=member.name, sum=md5sum(file)))