Question

我想从csv（文本）文件中逐行读取（在Python 2.7中），该文件是7z压缩的。我不想解压缩整个（大）文件，而是要对行进行流式传输。

我尝试pylzma.decompressobj()失败了。我收到数据错误。请注意，此代码尚未逐行读取：

input_filename = r"testing.csv.7z"
with open(input_filename, 'rb') as infile:
    obj = pylzma.decompressobj()
    o = open('decompressed.raw', 'wb')
    obj = pylzma.decompressobj()
    while True:
        tmp = infile.read(1)
        if not tmp: break
        o.write(obj.decompress(tmp))
    o.close()

输出：

    o.write(obj.decompress(tmp))
ValueError: data error during decompression

Answer 1

这将允许您迭代行。它部分源自我在answer中发现的另一个问题的代码。

据我所知，此时py7zlib没有提供允许将档案成员作为字节或字符流读取的API - ArchiveFile类仅提供read()函数，该函数解压缩并返回包含该成员的所有未压缩数据。鉴于此，您可以做的最好的事情是使用它作为缓冲区迭代地返回字节或行。以下是这样做的，但是如果问题是存档成员文件本身很大，那么很多都没有用。

我修改了下面的代码，可以在Python 2.7和3.x中使用。

import io
import os
import py7zlib

class SevenZFileError(py7zlib.ArchiveError):
    pass

class SevenZFile(object):
    @classmethod
    def is_7zfile(cls, filepath):
        """ Determine if filepath points to a valid 7z archive. """
        is7z = False
        fp = None
        try:
            fp = open(filepath, 'rb')
            archive = py7zlib.Archive7z(fp)
            _ = len(archive.getnames())
            is7z = True
        finally:
            if fp: fp.close()
        return is7z

    def __init__(self, filepath):
        fp = open(filepath, 'rb')
        self.filepath = filepath
        self.archive = py7zlib.Archive7z(fp)

    def __contains__(self, name):
        return name in self.archive.getnames()

    def readlines(self, name):
        """ Iterator of lines from an archive member. """
        if name not in self:
            raise SevenZFileError('archive member %r not found in %r' %
                                  (name, self.filepath))

        for line in io.StringIO(self.archive.getmember(name).read().decode()):
            yield line

样本用法：

import csv

if SevenZFile.is_7zfile('testing.csv.7z'):
    sevenZfile = SevenZFile('testing.csv.7z')

    if 'testing.csv' not in sevenZfile:
        print('testing.csv is not a member of testing.csv.7z')
    else:
        reader = csv.reader(sevenZfile.readlines('testing.csv'))
        for row in reader:
            print(', '.join(row))

Answer 2

如果您使用的是Python 3.3+，则可以使用lzma模块执行此操作，该模块已添加到该版本的标准库中。

请参阅：lzma Examples

Answer 3

如果可以使用python 3，则有一个有用的库py7zr，它支持部分 7zip解压缩，如下所示：

import py7zr
import re
filter_pattern = re.compile(r'<your/target/file_and_directories/regex/expression>')
with SevenZipFile('archive.7z', 'r') as archive:
    allfiles = archive.getnames()
    selective_files = [f if filter_pattern.match(f) for f in allfiles]
    archive.extract(targets=selective_files)

如何从7z压缩的文本文件中读取？

3 个答案: