在python中scikit-learn中有符号整数大于最大值

时间:2016-03-13 12:54:54

标签: python python-2.7 scikit-learn text-mining sentiment-analysis

我正在研究大约30,000条推文的情绪分析。 Linux上的python版本是2.7。在训练阶段,我使用OpenRowsetVariable作为nltk库的包装器来应用不同的分类器,如Naive Bayes,LinearSVC,Logistic回归等。

当推文数量达到10,000时,它工作正常但现在我在sklearn中使用多项式朴素贝叶斯对Bigrams进行分类时收到了30,000条推文的错误。以下是预处理和划分为训练集和测试集之后的实现代码的一部分:

sklearn

这里是错误:

import nltk
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB,

training_set = nltk.classify.util.apply_features(extractFeatures, trainTweets)
testing_set = nltk.classify.util.apply_features(extractFeatures, testTweets)

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
MNBAccuracy = nltk.classify.accuracy(MNB_classifier, testing_set)*100
print "-------- MultinomialNB --------"
print "RESULT : Matches  " + str(int((testSize*MNBAccuracy)/100)) + ":"+ str(testSize)
print "MNB accuracy percentage:" + str(MNBAccuracy)
print ""

我想原因是因为数组中索引的数量多于Traceback (most recent call last): File "/home/sb402747/Desktop/Sentiment/sentiment140API/analysing/Classifier.py", line 83, in <module> MNB_classifier.train(training_set) File "/home/sb402747/.local/lib/python2.7/site-packages/nltk/classify/scikitlearn.py", line 115, in train X = self._vectorizer.fit_transform(X) File "/home/sb402747/.local/lib/python2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 226, in fit_transform return self._transform(X, fitting=True) File "/home/sb402747/.local/lib/python2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 176, in _transform indptr.append(len(indices)) OverflowError: signed integer is greater than maximum 上允许的最大数量。我甚至尝试将dict_vectore.py中的索引类型从dict_vectorizer.py更改为i,但它没有解决我的问题并收到此错误:

l

然后丢弃它并再次将其更改回Traceback (most recent call last): File "/home/sb402747/Desktop/Sentiment/ServerBackup26-02-2016/analysing/Classifier.py", line 84, in <module> MNB_classifier.train(training_set) File "/home/sb402747/.local/lib/python2.7/site-packages/nltk/classify/scikitlearn.py", line 115, in train X = self._vectorizer.fit_transform(X) File "/home/sb402747/.local/lib/python2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 226, in fit_transform return self._transform(X, fitting=True) File "/home/sb402747/.local/lib/python2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 186, in _transform shape=shape, dtype=dtype) File "/rwthfs/rz/SW/UTIL.common/Python/2.7.9/x86_64/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 88, in __init__ self.check_format(full_check=False) File "/rwthfs/rz/SW/UTIL.common/Python/2.7.9/x86_64/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 167, in check_format raise ValueError("indices and data should have the same size") ValueError: indices and data should have the same size 。我怎么解决这个问题?

1 个答案:

答案 0 :(得分:0)

嗯,好像在这里:

File "/home/sb402747/.local/lib/python2.7/site-packages/nltk/classify/scikitlearn.py", line 115, in train
X = self._vectorizer.fit_transform(X)

nltk因此需要太大的矩阵。 也许你可以以某种方式改变它,例如最小化文本中的特征(单词)数量,或者在两次通过中请求这个结果?

另外,您是否尝试在最新的numpy / scipy / scikit-learn稳定版本上执行此操作?

另请阅读:https://sourceforge.net/p/scikit-learn/mailman/message/31340515/