我尝试使用Pickle将字典保存在文件中。保存字典的代码运行没有任何问题,但是当我尝试从Python shell中的文件中检索字典时,我得到一个EOF错误:
>>> import pprint
>>> pkl_file = open('data.pkl', 'rb')
>>> data1 = pickle.load(pkl_file)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/pickle.py", line 1378, in load
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 880, in load_eof
raise EOFError
EOFError
我的代码如下。
它计算每个单词的频率和数据的日期(日期是文件名。)然后将单词保存为字典的键和(freq,date)的元组作为每个键的值。现在我想用这本词典作为我工作另一部分的输入:
def pathFilesList():
source='StemmedDataset'
retList = []
for r,d,f in os.walk(source):
for files in f:
retList.append(os.path.join(r, files))
return retList
def parsing():
fileList = pathFilesList()
for f in fileList:
print "Processing file: " + str(f)
fileWordList = []
fileWordSet = set()
fw=codecs.open(f,'r', encoding='utf-8')
fLines = fw.readlines()
for line in fLines:
sWord = line.strip()
fileWordList.append(sWord)
if sWord not in fileWordSet:
fileWordSet.add(sWord)
for stemWord in fileWordSet:
stemFreq = fileWordList.count(stemWord)
if stemWord not in wordDict:
wordDict[stemWord] = [(f[15:-4], stemFreq)]
else:
wordDict[stemWord].append((f[15:-4], stemFreq))
fw.close()
if __name__ == "__main__":
parsing()
output = open('data.pkl', 'wb')
pickle.dump(wordDict, output)
output.close()
您认为问题是什么?
答案 0 :(得分:1)
由于这是Python2,因此您必须更清楚地了解编写源代码的编码。引用的PEP-0263详细解释了这一点。我的建议是,您尝试将以下内容添加到unpickle.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# The rest of your code....
顺便说一句,如果你要使用非ascii字符工作很多,那么使用Python3可能是个好主意。
答案 1 :(得分:0)
# Added some code and comments. To make the code more complete.
# Using collections.Counter to count words.
import os.path
import codecs
import pickle
from collections import Counter
wordDict = {}
def pathFilesList():
source='StemmedDataset'
retList = []
for r, d, f in os.walk(source):
for files in f:
retList.append(os.path.join(r, files))
return retList
# Starts to parse a corpus, it counts the frequency of each word and
# the date of the data (the date is the file name.) then saves words
# as keys of dictionary and the tuple of (freq,date) as values of each
# key.
def parsing():
fileList = pathFilesList()
for f in fileList:
date_stamp = f[15:-4]
print "Processing file: " + str(f)
fileWordList = []
fileWordSet = set()
# One word per line, strip space. No empty lines.
fw = codecs.open(f, mode = 'r' , encoding='utf-8')
fileWords = Counter(w for w in fw.read().split())
# For each unique word, count occurance and store in dict.
for stemWord, stemFreq in fileWords.items():
if stemWord not in wordDict:
wordDict[stemWord] = [(date_stamp, stemFreq)]
else:
wordDict[stemWord].append((date_stamp, stemFreq))
# Close file and do next.
fw.close()
if __name__ == "__main__":
# Parse all files and store in wordDict.
parsing()
output = open('data.pkl', 'wb')
# Assume wordDict is global.
print "Dumping wordDict of size {0}".format(len(wordDict))
pickle.dump(wordDict, output)
output.close()
答案 2 :(得分:0)
如果您正在寻找能够将大型数据字典保存到磁盘或数据库的东西,并且可以利用酸洗和编码(编解码器和散列图),那么您可能需要查看klepto
。
klepto
提供了用于写入数据库的字典抽象,包括将文件系统视为数据库(即将整个字典写入单个文件,或将每个条目写入其自己的文件)。对于大数据,我经常选择将字典表示为我的文件系统上的目录,并将每个条目都作为文件。 klepto
还提供缓存算法,因此如果您使用字典的文件系统后端,则可以通过利用内存缓存来避免速度损失。
>>> from klepto.archives import dir_archive
>>> d = {'a':1, 'b':2, 'c':map, 'd':None}
>>> # map a dict to a filesystem directory
>>> demo = dir_archive('demo', d, serialized=True)
>>> demo['a']
1
>>> demo['c']
<built-in function map>
>>> demo
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> # is set to cache to memory, so use 'dump' to dump to the filesystem
>>> demo.dump()
>>> del demo
>>>
>>> demo = dir_archive('demo', {}, serialized=True)
>>> demo
dir_archive('demo', {}, cached=True)
>>> # demo is empty, load from disk
>>> demo.load()
>>> demo
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> demo['c']
<built-in function map>
>>>
klepto
还有其他标记,例如compression
和memmode
,可用于自定义数据的存储方式(例如压缩级别,内存映射模式等)。
使用(MySQL等)数据库作为后端而不是文件系统同样容易(相同的界面)。您还可以关闭内存缓存,因此只需设置cached=False
,每次读/写都会直接进入存档。
klepto
, keymap
可以自定义您的编码。
>>> from klepto.keymaps import *
>>>
>>> s = stringmap(encoding='hex_codec')
>>> x = [1,2,'3',min]
>>> s(x)
'285b312c20322c202733272c203c6275696c742d696e2066756e6374696f6e206d696e3e5d2c29'
>>> p = picklemap(serializer='dill')
>>> p(x)
'\x80\x02]q\x00(K\x01K\x02U\x013q\x01c__builtin__\nmin\nq\x02e\x85q\x03.'
>>> sp = s+p
>>> sp(x)
'\x80\x02UT28285b312c20322c202733272c203c6275696c742d696e2066756e6374696f6e206d696e3e5d2c292c29q\x00.'
在此处获取klepto
:https://github.com/uqfoundation