Question

我将运行代码，将大量（~1000）相对较小（约50个键：字符串值）字典写入日志文件。我将通过一个自动执行此操作的程序来完成此操作。我正在考虑运行如下命令：

import random
import string
import cPickle as pickle
import zlib

fieldNames = ['AICc','Npix','Nparameters','DoF','chi-square','chi-square_nu']

tempDict = {}
overview = {}
iterList = []

# Create example dictionary to add to the log.
for item in fieldNames:
  tempDict[item] = random.choice([random.uniform(2,5), '', ''.join([random.choice(string.lowercase) for x in range(5)])])

# Compress and pickle and add the example dictionary to the log.
# tried  with 'ab' and 'wb' 
# is .p.gz the right extension for this kind of file??
# with open('google.p.gz', 'wb') as fp: 
with open('google.p.gz', 'ab') as fp:
  fp.write(zlib.compress(pickle.dumps(tempDict, pickle.HIGHEST_PROTOCOL),9))

# Attempt to read in entire log
i = 0
with open('google.p.gz', 'rb') as fp:
  # Call pickle.loads until all dictionaries loaded. 
  while 1:
    try:     
      i += 1
      iterList.append(i)
      overview[i] = {}
      overview[i] = pickle.loads(zlib.decompress(fp.read()))
    except:
      break

print tempDict
print overview

我希望能够加载写入日志文件的最后一个字典（google.p.gz），但它目前只加载第一个pickle.dump 。

此外，还有更好的方法来做我正在做的一切吗？我四处搜索，感觉我是唯一一个做这样事情的人，而且我发现这是过去一个不好的迹象。

Answer 1

您的输入和输出不匹配。输出记录时，您可以单独记录每条记录，将其腌制，压缩，然后将结果单独写入文件：

fp.write(zlib.compress(pickle.dumps(tempDict, pickle.HIGHEST_PROTOCOL),9))

但是当您输入记录时，您会读取整个文件，解压缩它，并从中取消单个对象：

pickle.loads(zlib.decompress(fp.read()))

因此，下次您致电fp.read()时，没有任何内容：您第一次阅读整个文件。

因此您必须将输入与输出相匹配。如何做到这一点取决于您的具体要求。我们假设您的要求是：

会有太多记录需要在磁盘上压缩文件。
所有记录一次性写入文件（您无需附加单个记录）。
您不需要随机访问文件中的记录（您将始终乐于阅读整个文件以获取最后一条记录）。

根据这些要求，使用zlib单独压缩每条记录是个坏主意。 zlib使用的DEFLATE algorithm通过查找重复序列来工作，因此最适合大量数据。它对单个记录没有太大作用。因此，让我们使用gzip模块压缩和解压缩整个文件。

在我浏览代码时，我对您的代码进行了一些其他改进。

import cPickle as pickle
import gzip
import random
import string

field_names = 'AICc Npix Nparameters DoF chi-square chi-square_nu'.split()

random_value_constructors = [
    lambda: random.uniform(2,5),
    lambda: ''.join(random.choice(string.lowercase)
                    for x in xrange(random.randint(0, 5)))]

def random_value():
    """
    Return a random value, either a small floating-point number or a
    short string.
    """
    return random.choice(random_value_constructors)()

def random_record():
    """
    Create and return a random example record.
    """
    return {name: random_value() for name in field_names}

def write_records(filename, records):
    """
    Pickle each record in `records` and compress them to `filename`.
    """
    with gzip.open(filename, 'wb') as f:
        for r in records:
            pickle.dump(r, f, pickle.HIGHEST_PROTOCOL)

def read_records(filename):
    """
    Decompress `filename`, unpickle records from it, and yield them.
    """
    with gzip.open(filename, 'rb') as f:
        while True:
            try:
                yield pickle.load(f)
            except EOFError:
                return

如何从日志文件加载所有cPickle转储？

1 个答案: