我正在研究大约30,000条推文的情绪分析。 Linux上的python版本是2.7。在训练阶段,我使用OpenRowsetVariable
作为nltk
库的包装器来应用不同的分类器,如Naive Bayes,LinearSVC,Logistic回归等。
当推文数量达到10,000时,它工作正常但现在我在sklearn
中使用多项式朴素贝叶斯对Bigrams进行分类时收到了30,000条推文的错误。以下是预处理和划分为训练集和测试集之后的实现代码的一部分:
sklearn
这里是错误:
import nltk
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB,
training_set = nltk.classify.util.apply_features(extractFeatures, trainTweets)
testing_set = nltk.classify.util.apply_features(extractFeatures, testTweets)
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
MNBAccuracy = nltk.classify.accuracy(MNB_classifier, testing_set)*100
print "-------- MultinomialNB --------"
print "RESULT : Matches " + str(int((testSize*MNBAccuracy)/100)) + ":"+ str(testSize)
print "MNB accuracy percentage:" + str(MNBAccuracy)
print ""
我想原因是因为数组中索引的数量多于Traceback (most recent call last):
File "/home/sb402747/Desktop/Sentiment/sentiment140API/analysing/Classifier.py", line 83, in <module>
MNB_classifier.train(training_set)
File "/home/sb402747/.local/lib/python2.7/site-packages/nltk/classify/scikitlearn.py", line 115, in train
X = self._vectorizer.fit_transform(X)
File "/home/sb402747/.local/lib/python2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 226, in fit_transform
return self._transform(X, fitting=True)
File "/home/sb402747/.local/lib/python2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 176, in _transform
indptr.append(len(indices))
OverflowError: signed integer is greater than maximum
上允许的最大数量。我甚至尝试将dict_vectore.py
中的索引类型从dict_vectorizer.py
更改为i
,但它没有解决我的问题并收到此错误:
l
然后丢弃它并再次将其更改回Traceback (most recent call last):
File "/home/sb402747/Desktop/Sentiment/ServerBackup26-02-2016/analysing/Classifier.py", line 84, in <module>
MNB_classifier.train(training_set)
File "/home/sb402747/.local/lib/python2.7/site-packages/nltk/classify/scikitlearn.py", line 115, in train
X = self._vectorizer.fit_transform(X)
File "/home/sb402747/.local/lib/python2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 226, in fit_transform
return self._transform(X, fitting=True)
File "/home/sb402747/.local/lib/python2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 186, in _transform
shape=shape, dtype=dtype)
File "/rwthfs/rz/SW/UTIL.common/Python/2.7.9/x86_64/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 88, in __init__
self.check_format(full_check=False)
File "/rwthfs/rz/SW/UTIL.common/Python/2.7.9/x86_64/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 167, in check_format
raise ValueError("indices and data should have the same size")
ValueError: indices and data should have the same size
。我怎么解决这个问题?
答案 0 :(得分:0)
File "/home/sb402747/.local/lib/python2.7/site-packages/nltk/classify/scikitlearn.py", line 115, in train
X = self._vectorizer.fit_transform(X)
nltk因此需要太大的矩阵。 也许你可以以某种方式改变它,例如最小化文本中的特征(单词)数量,或者在两次通过中请求这个结果?
另外,您是否尝试在最新的numpy / scipy / scikit-learn稳定版本上执行此操作?
另请阅读:https://sourceforge.net/p/scikit-learn/mailman/message/31340515/