我试图合并一个pickle格式的语料库数据和一个包含文章段落的文档。
我的实例上有足够的内存
但是我收到此错误
Traceback (most recent call last):
File "summarize.py", line 112, in <module>
data = [' '.join(document) for document in data]
File "summarize.py", line 112, in <listcomp>
data = [' '.join(document) for document in data]
MemoryError
Code
if __name__ == '__main__':
# Load corpus data used to train the TF-IDF Transformer
data = pickle.load(open('data.pkl', 'rb'))
# Load the document you wish to summarize
title = ''
document = ''
cleaned_document = clean_document(document)
doc = remove_stop_words(cleaned_document)
# Merge corpus data and new document data
data = [' '.join(document) for document in data]
train_data = set(data + [doc])
关于造成这种情况的原因或我如何克服它的任何想法?
错误发生在cleaning_document = ...
答案 0 :(得分:1)
当你到达data = [' '.join(document) for document in data]
时,我不认为它是使用泡菜。
File "summarize.py", line 112, in <listcomp>
data = [' '.join(document) for document in data]
MemoryError
这些代码行似乎没用,document
为空,因此doc
也是空的
如果MemoryError
仍然存在,请删除此行以进行测试。
#cleaned_document = clean_document(document)
#doc = remove_stop_words(cleaned_document)
尝试不重用var data
,例如
joined_data = [' '.join(document) for document in data]
train_data = set(data)
答案 1 :(得分:0)
MemoryError
指的是随机存取内存。您将太多数据加载到内存中。使用pickle的问题是它必须将整个文件加载到内存中,所以如果你正在腌制几个大对象,你可能会超出可用内存。可能有更好的方法来构建程序以避免这种情况。以可逐步读取的格式保存数据,或将pickle数据分成多个pickle文件。