Question

我试图通过在Python 2.7.10中使用多处理使用相同的函数来解析一堆.gz json文件（文件是文本文件）。但是，几乎在解析这些文件中每一行的最后，它会产生以下错误：

error: Error -3 while decompressing: invalid code lengths set

并停止执行。

这是我的代码：

import gzip
import json
from multiprocessing import Pool, cpu_count

def build_list(file_name):

    count = 0

    try:
        json_file = gzip.open(file_name, "r")
    except Exception as e:
        print e
    else:

        # Data parsing
        for line in json_file:
            try:
                row = json.loads(line)
            except Exception as e:
                print e
            else:                
                count += 1

if __name__ == "__main__":

    files = ["h1.json.gz", "h2.json.gz", "h3.json.gz", "h4.json.gz", "h5.json.gz"]

    pool = Pool(processes=cpu_count()-1)
    pool.map(build_list, files)

重要的是要澄清程序开始运行良好，并且当我使用top检查时，文件是在每个处理器上分配的。我还用gunzip -t检查文件的完整性，它们似乎很好。此外，我没有看到错误之前引发任何异常。你有什么想法我怎么解决它？提前致谢。

Answer 1

以二进制模式读取：

# npm install -g live-server

在某些平台上阅读文本模式可能会破坏数据（因为它不是文本），并且会导致奇怪的错误，例如此错误。

Answer 2

我最终使用了一个try块，用于在读取时检查指针中每一行的完整性。所以最终的代码如下：

def build_list(file_name):

    count = 0

    try:
        json_file = gzip.open(file_name, "r")
    except Exception as e:
        print e
    else:

        try:
            # Data parsing
            for line in json_file:
                try:
                    row = json.loads(line)
                except Exception as e:
                    print e
                else:                
                    count += 1
        except Exception as e:
            print e

感谢您的所有意见。

读取一堆.gz文件错误：解压缩时出错-3：设置的代码长度无效

2 个答案: