使用多项朴素贝叶斯进行文本分类时处理训练集的形状和拟合时出错

时间:2018-05-21 05:23:36

标签: python numpy scikit-learn

我已完成所有预处理任务,例如删除停用词,HTML标记等。 我试图使用Multinomial Naive Bayes对IMDB电影数据集(Stanford Unoversity的大型电影评论数据集)进行分类。我在varibable X上遇到错误。我已经制作成2D阵列但不知道如何处理错误?

这是Multinomial Naive Bayes代码的一部分。

categories = ['pos','neg']
         doc_to_train  = sklearn.datasets.load_files("/home/satyam/aclImdb_v1/aclImdb/train", description = None, categories = categories ,load_content=True,enco    ding='utf-8',shuffle=True,random_state=42)
    vectorizer = CountVectorizer()
    X = (vectorizer.fit_transform(tokens).toarray())
    analyze = vectorizer.build_analyzer()
    vect = vectorizer.get_feature_names()
    y = np.array(doc_to_train.target)
    X = X.reshape()                 
    X = X.transpose()
    print (X)
    X_train, X_test, y_train,y_test= train_test_split(X,y, test_size=0.3)
    mnb=MultinomialNB().fit(X_train,y_train).predict(X_test)
    print ("MNB " %mnb)
    print ("Prediction " %mnb.predict(X_test))
    accuracy = mnb.score(X_test, y_test)
    print ("Accuracy " %accuracy)

遇到的错误是

Traceback (most recent call last):
  File "sentiment_analysis_NB.py", line 92, in <module>
    X = (vectorizer.fit_transform(tokens).toarray())
  File "/usr/lib/python3.6/site-packages/scipy/sparse/compressed.py", line 943, in toarray
    out = self._process_toarray_args(order, out)
  File "/usr/lib/python3.6/site-packages/scipy/sparse/base.py", line 1130, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

0 个答案:

没有答案