如何在NLTK中更改POS标签的maxent分类器中的迭代次数?

时间:2016-09-08 12:42:46

标签: python nlp classification nltk

我正在尝试使用ClassifierBasedPOSTaggerclassifier_builder=MaxentClassifier.train执行POS标记。这是一段代码:

from nltk.tag.sequential import ClassifierBasedPOSTagger
from nltk.classify import MaxentClassifier
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)

train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)
print(me_tagger.evaluate(test_sents))

但运行代码一小时后,我发现它仍在初始化ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)。在输出中,我可以看到以下代码运行:

  ==> Training (100 iterations)

  Iteration    Log Likelihood    Accuracy
  ---------------------------------------
         1          -5.35659        0.007
         2          -0.85922        0.953
         3          -0.56125        0.986

我认为在分类器准备好将部分语音标记为任何输入之前,迭代将是100。我想这需要一整天。为什么要花这么多时间?并且减少迭代会使这段代码变得有点实用(意味着减少时间并且仍然足够有用),如果是,那么如何减少这些迭代呢?

修改

1.5小时后,我得到以下输出:

  ==> Training (100 iterations)

  Iteration    Log Likelihood    Accuracy
  ---------------------------------------
         1          -5.35659        0.007
         2          -0.85922        0.953
         3          -0.56125        0.986
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1310: RuntimeWarning: overflow encountered in power
  exp_nf_delta = 2 ** nf_delta
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1312: RuntimeWarning: invalid value encountered in multiply
  sum1 = numpy.sum(exp_nf_delta * A, axis=0)
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1313: RuntimeWarning: invalid value encountered in multiply
  sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)
         Final               nan        0.991
0.892155885577594

算法是否应该按照输出的第一行中的指定进入100 iterations,并且由于错误它没有?是否有任何可能的方法来减少培训所需的时间?

1 个答案:

答案 0 :(得分:2)

您可以将max_iter的参数值设置为所需的数字。

<强>代码:

from nltk.tag.sequential import ClassifierBasedPOSTagger
from nltk.classify import MaxentClassifier
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')
# Change size based on your requirement
size = int(len(brown_tagged_sents) * 0.05)
print("size:",size)

train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

#me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)
me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=lambda train_feats: MaxentClassifier.train(train_feats, max_iter=15))
print(me_tagger.evaluate(test_sents))

<强>输出:

('size:', 231)
  ==> Training (15 iterations)

  Iteration    Log Likelihood    Accuracy
  ---------------------------------------
         1          -4.67283        0.013
         2          -0.89282        0.964
         3          -0.56137        0.998
         4          -0.40573        0.999
         5          -0.31761        0.999
         6          -0.26107        0.999
         7          -0.22175        0.999
         8          -0.19284        0.999
         9          -0.17067        0.999
        10          -0.15315        0.999
        11          -0.13894        0.999
        12          -0.12719        0.999
        13          -0.11730        0.999
        14          -0.10887        0.999
     Final          -0.10159        0.999
0.787489765499

编辑

这些消息是RuntimeWarnings而不是错误。

在第4次迭代后,它找到Log Likelihood = nan,因此它停止了进一步处理。因此,它成为最后的迭代。