
时间:2016-07-31 07:18:26

标签: python nltk pickle


  1. 我做错了什么?
  2. 将HMM序列化是一个好主意 什么时候有一个数据集?
  3. 以下是代码:

    In [1]: import nltk
    In [2]: from nltk.probability import *
    In [3]: from nltk.util import unique_list
    In [4]: import json
    In [5]: with open('data.json') as data_file:
       ...:         corpus = json.load(data_file)
    In [6]: corpus = [[tuple(l) for l in sentence] for sentence in corpus]
    In [7]: tag_set = unique_list(tag for sent in corpus for (word,tag) in sent)
    In [8]: symbols = unique_list(word for sent in corpus for (word,tag) in sent)
    In [9]: trainer = nltk.tag.HiddenMarkovModelTrainer(tag_set, symbols)
    In [10]: train_corpus = corpus[:4]
    In [11]: test_corpus = [corpus[4]]
    In [12]: hmm = trainer.train_supervised(train_corpus, estimator=LaplaceProbDist)
    In [13]: print('%.2f%%' % (100 * hmm.evaluate(test_corpus)))


    In [14]: import pickle
    In [16]: output = open('hmm.pkl', 'wb')
    In [17]: pickle.dump(hmm, output)
    In [18]: output.close()


    In [19]: %reset
    Once deleted, variables cannot be recovered. Proceed (y/[n])? y
    In [20]: import pickle
    In [21]: import json
    In [22]: with open('data.json') as data_file:
       ....:     corpus = json.load(data_file)
    In [23]: test_corpus = [corpus[4]]
    In [24]: pkl_file = open('hmm.pkl', 'rb')
    In [25]: hmm = pickle.load(pkl_file)
    In [26]: pkl_file.close()
    In [27]: type(hmm)
    Out[27]: nltk.tag.hmm.HiddenMarkovModelTagger
    In [28]: print('%.2f%%' % (100 * hmm.evaluate(test_corpus)))

1 个答案:

答案 0 :(得分:0)

1)在In [22]之后,你需要添加 -

corpus = [[tuple(l) for l in sentence] for sentence in corpus]

2)每次重新训练模型用于测试目的将是耗时的。 所以,pickle.dump你的模型并加载它是件好事。