我想从csv(文本)文件中逐行读取(在Python 2.7中),该文件是7z压缩的。我不想解压缩整个(大)文件,而是要对行进行流式传输。
我尝试pylzma.decompressobj()
失败了。我收到数据错误。请注意,此代码尚未逐行读取:
input_filename = r"testing.csv.7z"
with open(input_filename, 'rb') as infile:
obj = pylzma.decompressobj()
o = open('decompressed.raw', 'wb')
obj = pylzma.decompressobj()
while True:
tmp = infile.read(1)
if not tmp: break
o.write(obj.decompress(tmp))
o.close()
输出:
o.write(obj.decompress(tmp))
ValueError: data error during decompression
答案 0 :(得分:7)
这将允许您迭代行。它部分源自我在answer中发现的另一个问题的代码。
据我所知,此时py7zlib
没有提供允许将档案成员作为字节或字符流读取的API - ArchiveFile
类仅提供read()
函数,该函数解压缩并返回包含该成员的所有未压缩数据。鉴于此,您可以做的最好的事情是使用它作为缓冲区迭代地返回字节或行。以下是这样做的,但是如果问题是存档成员文件本身很大,那么很多都没有用。
我修改了下面的代码,可以在Python 2.7和3.x中使用。
import io
import os
import py7zlib
class SevenZFileError(py7zlib.ArchiveError):
pass
class SevenZFile(object):
@classmethod
def is_7zfile(cls, filepath):
""" Determine if filepath points to a valid 7z archive. """
is7z = False
fp = None
try:
fp = open(filepath, 'rb')
archive = py7zlib.Archive7z(fp)
_ = len(archive.getnames())
is7z = True
finally:
if fp: fp.close()
return is7z
def __init__(self, filepath):
fp = open(filepath, 'rb')
self.filepath = filepath
self.archive = py7zlib.Archive7z(fp)
def __contains__(self, name):
return name in self.archive.getnames()
def readlines(self, name):
""" Iterator of lines from an archive member. """
if name not in self:
raise SevenZFileError('archive member %r not found in %r' %
(name, self.filepath))
for line in io.StringIO(self.archive.getmember(name).read().decode()):
yield line
样本用法:
import csv
if SevenZFile.is_7zfile('testing.csv.7z'):
sevenZfile = SevenZFile('testing.csv.7z')
if 'testing.csv' not in sevenZfile:
print('testing.csv is not a member of testing.csv.7z')
else:
reader = csv.reader(sevenZfile.readlines('testing.csv'))
for row in reader:
print(', '.join(row))
答案 1 :(得分:2)
答案 2 :(得分:-1)
如果可以使用python 3,则有一个有用的库py7zr,它支持部分 7zip解压缩,如下所示:
import py7zr
import re
filter_pattern = re.compile(r'<your/target/file_and_directories/regex/expression>')
with SevenZipFile('archive.7z', 'r') as archive:
allfiles = archive.getnames()
selective_files = [f if filter_pattern.match(f) for f in allfiles]
archive.extract(targets=selective_files)