我正在尝试使用ClassifierBasedPOSTagger
和classifier_builder=MaxentClassifier.train
执行POS标记。这是一段代码:
from nltk.tag.sequential import ClassifierBasedPOSTagger
from nltk.classify import MaxentClassifier
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)
print(me_tagger.evaluate(test_sents))
但运行代码一小时后,我发现它仍在初始化ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)
。在输出中,我可以看到以下代码运行:
==> Training (100 iterations)
Iteration Log Likelihood Accuracy
---------------------------------------
1 -5.35659 0.007
2 -0.85922 0.953
3 -0.56125 0.986
我认为在分类器准备好将部分语音标记为任何输入之前,迭代将是100。我想这需要一整天。为什么要花这么多时间?并且减少迭代会使这段代码变得有点实用(意味着减少时间并且仍然足够有用),如果是,那么如何减少这些迭代呢?
修改
1.5小时后,我得到以下输出:
==> Training (100 iterations)
Iteration Log Likelihood Accuracy
---------------------------------------
1 -5.35659 0.007
2 -0.85922 0.953
3 -0.56125 0.986
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1310: RuntimeWarning: overflow encountered in power
exp_nf_delta = 2 ** nf_delta
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1312: RuntimeWarning: invalid value encountered in multiply
sum1 = numpy.sum(exp_nf_delta * A, axis=0)
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1313: RuntimeWarning: invalid value encountered in multiply
sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)
Final nan 0.991
0.892155885577594
算法是否应该按照输出的第一行中的指定进入100 iterations
,并且由于错误它没有?是否有任何可能的方法来减少培训所需的时间?
答案 0 :(得分:2)
您可以将max_iter
的参数值设置为所需的数字。
<强>代码:强>
from nltk.tag.sequential import ClassifierBasedPOSTagger
from nltk.classify import MaxentClassifier
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
# Change size based on your requirement
size = int(len(brown_tagged_sents) * 0.05)
print("size:",size)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
#me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)
me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=lambda train_feats: MaxentClassifier.train(train_feats, max_iter=15))
print(me_tagger.evaluate(test_sents))
<强>输出:强>
('size:', 231)
==> Training (15 iterations)
Iteration Log Likelihood Accuracy
---------------------------------------
1 -4.67283 0.013
2 -0.89282 0.964
3 -0.56137 0.998
4 -0.40573 0.999
5 -0.31761 0.999
6 -0.26107 0.999
7 -0.22175 0.999
8 -0.19284 0.999
9 -0.17067 0.999
10 -0.15315 0.999
11 -0.13894 0.999
12 -0.12719 0.999
13 -0.11730 0.999
14 -0.10887 0.999
Final -0.10159 0.999
0.787489765499
编辑:
这些消息是RuntimeWarnings而不是错误。
在第4次迭代后,它找到Log Likelihood = nan
,因此它停止了进一步处理。因此,它成为最后的迭代。