训练分类器作为批处理

时间:2014-11-13 10:55:11

标签: python parallel-processing machine-learning batch-processing nltk

我在nltk中使用大型数据集(每个都有5 * 10^5个15个数据文件)训练分类器,

所以我在这个错误之间陷入困境:

Traceback (most recent call last):
  File "term_classify.py", line 51, in <module>
    classifier = obj.run_classifier(cltype)
  File "/root/Desktop/karim/software/nlp/nltk/publish/lists/classifier_function.py", line 146, in run_classifier
    classifier = NaiveBayesClassifier.train(train_set)
  File "/usr/local/lib/python2.7/dist-packages/nltk/classify/naivebayes.py", line 210, in train
    count = feature_freqdist[label, fname].N()
MemoryError

代码:

def run_classifier(self,cltype):
    # create our dict of training data
    texts = {}
    texts['act'] = 'act'
    texts['art'] = 'art'
    texts['animal'] = 'anim'
    texts['country'] = 'country'
    texts['company'] = 'comp'
    train_set = [] 
    train_set = train_set + [(self.get_feature(word), sense) for word in features]
    #len of train_set = 545668. Better if we can push 100000 at a time
    classifier = NaiveBayesClassifier.train(train_set)

有没有办法分批训练分类器或任何其他方式,以便减少负荷,从而影响结果

1 个答案:

答案 0 :(得分:0)

您始终可以切换到支持批量学习(部分适合)的scikit-learn贝叶斯

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB

  

partial_fit(X,y,classes = None,sample_weight = None)

     

增量适合   一批样品。预计这种方法会被多次调用   连续地在数据集的不同块上以便实现   核心或在线学习。当这个时候特别有用   整个数据集太大,无法立即适应内存。这种方法有   一些性能开销因此最好调用partial_fit   尽可能大的数据块(只要适合   内存预算)以隐藏开销。