Question

我正在尝试将nltk.tag.hmm.HiddenMarkovModelTagger序列化为一个pickle，以便在需要时使用它而无需重新训练。但是，从.pkl加载后，我的HMM看起来没有受过训练。我的两个问题是：

我做错了什么？
将HMM序列化是一个好主意什么时候有一个大数据集？

以下是代码：

In [1]: import nltk

In [2]: from nltk.probability import *

In [3]: from nltk.util import unique_list

In [4]: import json

In [5]: with open('data.json') as data_file:
   ...:         corpus = json.load(data_file)
   ...:     

In [6]: corpus = [[tuple(l) for l in sentence] for sentence in corpus]

In [7]: tag_set = unique_list(tag for sent in corpus for (word,tag) in sent)

In [8]: symbols = unique_list(word for sent in corpus for (word,tag) in sent)

In [9]: trainer = nltk.tag.HiddenMarkovModelTrainer(tag_set, symbols)

In [10]: train_corpus = corpus[:4]

In [11]: test_corpus = [corpus[4]]

In [12]: hmm = trainer.train_supervised(train_corpus, estimator=LaplaceProbDist)

In [13]: print('%.2f%%' % (100 * hmm.evaluate(test_corpus)))
100.00%

正如您所看到的，HMM已经过培训。现在我腌它：

In [14]: import pickle

In [16]: output = open('hmm.pkl', 'wb')

In [17]: pickle.dump(hmm, output)

In [18]: output.close()

重置并加载后，模型看起来比一箱石头笨重：

In [19]: %reset
Once deleted, variables cannot be recovered. Proceed (y/[n])? y

In [20]: import pickle

In [21]: import json

In [22]: with open('data.json') as data_file:
   ....:     corpus = json.load(data_file)
   ....:     

In [23]: test_corpus = [corpus[4]]

In [24]: pkl_file = open('hmm.pkl', 'rb')

In [25]: hmm = pickle.load(pkl_file)

In [26]: pkl_file.close()

In [27]: type(hmm)
Out[27]: nltk.tag.hmm.HiddenMarkovModelTagger

In [28]: print('%.2f%%' % (100 * hmm.evaluate(test_corpus)))
0.00%

Answer 1

1）在In [22]之后，你需要添加 -

corpus = [[tuple(l) for l in sentence] for sentence in corpus]

2）每次重新训练模型用于测试目的将是耗时的。所以，pickle.dump你的模型并加载它是件好事。

从泡菜装入的HMM看起来没有经过训练

1 个答案: