所以,在Python中,我使用makovify构建大型文本语料库的马尔可夫模型,用它生成随机句子。我也使用nltk使马尔可夫模型服从句子结构。由于从大型语料库生成马尔可夫模型需要相当长的时间,特别是使用nltk的词性标注器,每次生成相同的模型都非常浪费,因此我决定将Markov模型保存为JSON文件以便以后重用它们。 但是,当我尝试用Python读取这些多个大型JSON文件时,我遇到了一些问题。以下是代码:
import nltk
import markovify
import os
import json
pathfiles = 'C:/Users/MF/Documents/NetBeansProjects/Newyckham/data/'
filenames = []
ebook = []
def build_it (path):
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(".json"):
filenames.append(os.path.join(root, file))
for file in filenames:
print(str(file))
with open(file) as myjson:
ebook.append(markovify.Text.from_json(json.load(myjson)))
return ebook
text_model = markovify.combine(build_it(pathfiles))
for i in range(5):
print(text_model.make_sentence())
print('\r\n')
print(text_model.make_short_sentence(140))
print('\r\n')
但是我收到以下错误:
Traceback (most recent call last):
File "C:\Users\MF\Desktop\eclipse\markovify-master\terceiro.py", line 24, in <
module>
text_model = markovify.combine(build_it(pathfiles))
File "C:\Users\MF\Desktop\eclipse\markovify-master\terceiro.py", line 21, in b
uild_it
ebook.append(markovify.Text.from_json(json.load(myjson)))
File "C:\Python27\lib\json\__init__.py", line 290, in load
**kw)
File "C:\Python27\lib\json\__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "C:\Python27\lib\json\decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python27\lib\json\decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
MemoryError
我在这个网站上已经阅读了一些关于如何处理这个问题的类似问题,其中大多数都指向使用ijson并跳过JSON文件中不需要的部分,但是,在这些JSON中我没有什么可以跳过的,那有什么想法吗?
答案 0 :(得分:0)
您可以使用生成器来消除json的冗余副本。
def get_filenames(path):
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(".json"):
yield os.path.join(root, file)
def build_it(path):
for file in get_filenames(path):
print(str(file))
with open(file) as myjson:
yield markovify.Text.from_json(json.load(myjson))