将Pickle文件与另一个文档合并 - 内存错误

时间:2017-05-08 15:51:13

标签: python merge pickle

我试图合并一个pickle格式的语料库数据和一个包含文章段落的文档。

我的实例上有足够的内存

但是我收到此错误

Traceback (most recent call last):
 File "summarize.py", line 112, in <module>
 data = [' '.join(document) for document in data]
 File "summarize.py", line 112, in <listcomp>
data = [' '.join(document) for document in data]
MemoryError

Code

if __name__ == '__main__':
# Load corpus data used to train the TF-IDF Transformer
data = pickle.load(open('data.pkl', 'rb'))

# Load the document you wish to summarize
title = ''
document = ''

cleaned_document = clean_document(document)
doc = remove_stop_words(cleaned_document)

# Merge corpus data and new document data
data = [' '.join(document) for document in data]
train_data = set(data + [doc])

关于造成这种情况的原因或我如何克服它的任何想法?

错误发生在cleaning_document = ...

2 个答案:

答案 0 :(得分:1)

当你到达data = [' '.join(document) for document in data]时,我不认为它是使用泡菜。

File "summarize.py", line 112, in <listcomp>
data = [' '.join(document) for document in data]
MemoryError

这些代码行似乎没用,document为空,因此doc也是空的 如果MemoryError仍然存在,请删除此行以进行测试。

#cleaned_document = clean_document(document)
#doc = remove_stop_words(cleaned_document)

尝试重用var data,例如

joined_data = [' '.join(document) for document in data]
train_data = set(data)

答案 1 :(得分:0)

MemoryError指的是随机存取内存。您将太多数据加载到内存中。使用pickle的问题是它必须将整个文件加载到内存中,所以如果你正在腌制几个大对象,你可能会超出可用内存。可能有更好的方法来构建程序以避免这种情况。以可逐步读取的格式保存数据,或将pickle数据分成多个pickle文件。