我正在努力将我的备份脚本从shell转换为Python。我的旧脚本的一个功能是通过执行以下操作来检查创建的tarfile的完整性:gzip -t。
这在Python中似乎有点棘手。
似乎唯一的方法是通过读取tarfile中的每个压缩的TarInfo对象。
有没有办法检查tarfile的完整性,不提取到磁盘,或将其保存在内存中(完整地)?
freenode上的#python好人建议我应该逐块读取每个TarInfo对象,丢弃每个块读取。
我必须承认,我不知道如何做到这一点,因为我刚刚开始使用Python。
想象一下,我有一个30GB的tarfile,其中包含1kb到10GB的文件......
这是我开始编写的解决方案:
try:
tardude = tarfile.open("zero.tar.gz")
except:
print "There was an error opening tarfile. The file might be corrupt or missing."
for member_info in tardude.getmembers():
try:
check = tardude.extractfile(member_info.name)
except:
print "File: %r is corrupt." % member_info.name
tardude.close()
此代码远未完成。我不敢在一个巨大的30GB tar存档上运行它,因为在某一点上,检查将是10 + GB的对象(如果我在tar存档中有这么大的文件)
加成: 我试过手动破坏zero.tar.gz(十六进制编辑器 - 编辑几个字节midfile)。第一个除了没有捕获IOError ...这是输出:
Traceback (most recent call last):
File "./test.py", line 31, in <module>
for member_info in tardude.getmembers():
File "/usr/lib/python2.7/tarfile.py", line 1805, in getmembers
self._load() # all members, we first have to
File "/usr/lib/python2.7/tarfile.py", line 2380, in _load
tarinfo = self.next()
File "/usr/lib/python2.7/tarfile.py", line 2315, in next
self.fileobj.seek(self.offset)
File "/usr/lib/python2.7/gzip.py", line 429, in seek
self.read(1024)
File "/usr/lib/python2.7/gzip.py", line 256, in read
self._read(readsize)
File "/usr/lib/python2.7/gzip.py", line 320, in _read
self._read_eof()
File "/usr/lib/python2.7/gzip.py", line 342, in _read_eof
hex(self.crc)))
IOError: CRC check failed 0xe5384b87 != 0xdfe91e1L
答案 0 :(得分:2)
我试过手动破坏zero.tar.gz(十六进制编辑器 - 编辑几个字节 midfile)。第一个除了没有捕获IOError ...
如果你看一下回溯,当你拨打tardude.getmembers()
时,你会看到它被抛出,所以你需要像......一样......
try:
tardude = tarfile.open("zero.tar.gz")
except:
print "There was an error opening tarfile. The file might be corrupt or missing."
try:
members = tardude.getmembers()
except:
print "There was an error reading tarfile members."
for member_info in members:
try:
check = tardude.extractfile(member_info.name)
except:
print "File: %r is corrupt." % member_info.name
tardude.close()
至于原来的问题,你几乎就在那里。您只需要使用类似......
的内容读取check
对象中的数据
BLOCK_SIZE = 1024
try:
tardude = tarfile.open("zero.tar.gz")
except:
print "There was an error opening tarfile. The file might be corrupt or missing."
try:
members = tardude.getmembers()
except:
print "There was an error reading tarfile members."
for member_info in members:
try:
check = tardude.extractfile(member_info.name)
while 1:
data = check.read(BLOCK_SIZE)
if not data:
break
except:
print "File: %r is corrupt." % member_info.name
tardude.close()
...这应该确保你一次不会使用超过BLOCK_SIZE
个字节的内存。
另外,你应该尽量避免使用......
try:
do_something()
except:
do_something_else()
...因为它会掩盖意外的异常。尝试只捕获您实际打算处理的异常,例如......
try:
do_something()
except IOError:
do_something_else()
...否则你会发现检测代码中的错误更加困难。
答案 1 :(得分:2)
PreferenceFragmentCompat答案只是略微改进,使事情变得更加惯用(尽管我删除了一些错误检查以使机制更加明显):
BLOCK_SIZE = 1024
with tarfile.open("zero.tar.gz") as tardude:
for member in tardude.getmembers():
with tardude.extractfile(member.name) as target:
for chunk in iter(lambda: target.read(BLOCK_SIZE), b''):
pass
这实际上只是删除while 1:
(有时被认为是次要代码气味)和if not data:
检查。另请注意,使用with
会将此限制为Python 2.7 +
答案 2 :(得分:1)
您可以使用subprocess
模块在文件中调用gzip -t
...
from subprocess import call
import os
with open(os.devnull, 'w') as bb:
result = call(['gzip', '-t', "zero.tar.gz"], stdout=bb, stderr=bb)
如果result
不为0,则有些不对劲。但是,您可能想检查gzip是否可用。我为此写了一个实用函数;
import subprocess
import sys
import os
def checkfor(args, rv = 0):
"""Make sure that a program necessary for using this script is
available.
Arguments:
args -- string or list of strings of commands. A single string may
not contain spaces.
rv -- expected return value from evoking the command.
"""
if isinstance(args, str):
if ' ' in args:
raise ValueError('no spaces in single command allowed')
args = [args]
try:
with open(os.devnull, 'w') as bb:
rc = subprocess.call(args, stdout=bb, stderr=bb)
if rc != rv:
raise OSError
except OSError as oops:
outs = "Required program '{}' not found: {}."
print(outs.format(args[0], oops.strerror))
sys.exit(1)