我在nltk中使用大型数据集(每个都有5 * 10^5
个15个数据文件)训练分类器,
所以我在这个错误之间陷入困境:
Traceback (most recent call last):
File "term_classify.py", line 51, in <module>
classifier = obj.run_classifier(cltype)
File "/root/Desktop/karim/software/nlp/nltk/publish/lists/classifier_function.py", line 146, in run_classifier
classifier = NaiveBayesClassifier.train(train_set)
File "/usr/local/lib/python2.7/dist-packages/nltk/classify/naivebayes.py", line 210, in train
count = feature_freqdist[label, fname].N()
MemoryError
代码:
def run_classifier(self,cltype):
# create our dict of training data
texts = {}
texts['act'] = 'act'
texts['art'] = 'art'
texts['animal'] = 'anim'
texts['country'] = 'country'
texts['company'] = 'comp'
train_set = []
train_set = train_set + [(self.get_feature(word), sense) for word in features]
#len of train_set = 545668. Better if we can push 100000 at a time
classifier = NaiveBayesClassifier.train(train_set)
有没有办法分批训练分类器或任何其他方式,以便减少负荷,从而影响结果
答案 0 :(得分:0)
您始终可以切换到支持批量学习(部分适合)的scikit-learn
贝叶斯
partial_fit(X,y,classes = None,sample_weight = None)
增量适合 一批样品。预计这种方法会被多次调用 连续地在数据集的不同块上以便实现 核心或在线学习。当这个时候特别有用 整个数据集太大,无法立即适应内存。这种方法有 一些性能开销因此最好调用partial_fit 尽可能大的数据块(只要适合 内存预算)以隐藏开销。