我有一个我想要阅读的文件,它本身是在zip存档中压缩的。例如,parent.zip包含child.zip,其中包含child.txt。我在阅读child.zip时遇到了麻烦。任何人都可以更正我的代码吗?
我假设我需要将child.zip创建为类似文件的对象,然后使用第二个zipfile实例打开它,但是对于python我的zipfile.ZipFile(zfile.open(name))是傻的。它引发了一个zipfile.BadZip文件:“文件不是一个zip文件”(独立验证)child.zip
import zipfile
with zipfile.ZipFile("parent.zip", "r") as zfile:
for name in zfile.namelist():
if re.search(r'\.zip$', name) is not None:
# We have a zip within a zip
with **zipfile.ZipFile(zfile.open(name))** as zfile2:
for name2 in zfile2.namelist():
# Now we can extract
logging.info( "Found internal internal file: " + name2)
print "Processing code goes here"
答案 0 :(得分:41)
当您在.open()
实例上使用ZipFile
调用时,确实会获得一个打开的文件句柄。但是,要读取一个zip文件,ZipFile
类需要更多。它需要能够在该文件上搜索,并且.open()
返回的对象在您的情况下是不可搜索的。只有Python 3(3.2及更高版本)生成一个支持搜索的ZipExFile
对象(前提是外部zip文件的底层文件句柄是可搜索的,并且没有任何东西试图写入ZipFile
对象)。 / p>
解决方法是使用.read()
将整个zip条目读入内存,将其存储在BytesIO
对象( 可搜索的内存中文件)和Feed中那到ZipFile
:
from io import BytesIO
# ...
zfiledata = BytesIO(zfile.read(name))
with zipfile.ZipFile(zfiledata) as zfile2:
或者,在您的示例中:
import zipfile
from io import BytesIO
with zipfile.ZipFile("parent.zip", "r") as zfile:
for name in zfile.namelist():
if re.search(r'\.zip$', name) is not None:
# We have a zip within a zip
zfiledata = BytesIO(zfile.read(name))
with zipfile.ZipFile(zfiledata) as zfile2:
for name2 in zfile2.namelist():
# Now we can extract
logging.info( "Found internal internal file: " + name2)
print "Processing code goes here"
答案 1 :(得分:9)
要使用python33(在windows下但可能不相关),我必须这样做:
import zipfile, re, io
with zipfile.ZipFile(file, 'r') as zfile:
for name in zfile.namelist():
if re.search(r'\.zip$', name) != None:
zfiledata = io.BytesIO(zfile.read(name))
with zipfile.ZipFile(zfiledata) as zfile2:
for name2 in zfile2.namelist():
print(name2)
cStringIO不存在所以我使用了io.BytesIO
答案 2 :(得分:0)
这是我提出的一项功能。 (从here复制。)
def extract_nested_zipfile(path, parent_zip=None):
"""Returns a ZipFile specified by path, even if the path contains
intermediary ZipFiles. For example, /root/gparent.zip/parent.zip/child.zip
will return a ZipFile that represents child.zip
"""
def extract_inner_zipfile(parent_zip, child_zip_path):
"""Returns a ZipFile specified by child_zip_path that exists inside
parent_zip.
"""
memory_zip = StringIO()
memory_zip.write(parent_zip.open(child_zip_path).read())
return zipfile.ZipFile(memory_zip)
if ('.zip' + os.sep) in path:
(parent_zip_path, child_zip_path) = os.path.relpath(path).split(
'.zip' + os.sep, 1)
parent_zip_path += '.zip'
if not parent_zip:
# This is the top-level, so read from disk
parent_zip = zipfile.ZipFile(parent_zip_path)
else:
# We're already in a zip, so pull it out and recurse
parent_zip = extract_inner_zipfile(parent_zip, parent_zip_path)
return extract_nested_zipfile(child_zip_path, parent_zip)
else:
if parent_zip:
return extract_inner_zipfile(parent_zip, path)
else:
# If there is no nesting, it's easy!
return zipfile.ZipFile(path)
这是我测试它的方式:
echo hello world > hi.txt
zip wrap1.zip hi.txt
zip wrap2.zip wrap1.zip
zip wrap3.zip wrap2.zip
print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap1.zip').open('hi.txt').read()
print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap2.zip/wrap1.zip').open('hi.txt').read()
print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap3.zip/wrap2.zip/wrap1.zip').open('hi.txt').read()