sklearn交叉验证分数为每个折叠数提供相同的结果

时间:2017-09-05 16:55:30

标签: python-3.x machine-learning scikit-learn

我无法弄清楚为什么交叉验证会给我总是相同的准确度(0.92),无论我使用多少折叠。

即使我删除参数cv = 10,它也会给我相同的结果。

#read preprocessed data
traindata = ast.literal_eval(open('pretprocesirano.txt').read())
testdata = ast.literal_eval(open('pretprocesiranoTEST.txt').read())

#create word vector
vectorizer= CountVectorizer(tokenizer=lambda x:x.split(), min_df=3, max_features=300)
traindataCV=vectorizer.fit_transform(traindata)

#save wordlist
wordlist=vectorizer.vocabulary_

#save vectorizer
SavedVectorizer = CountVectorizer(vocabulary=wordlist)

#transform test data
testdataCV=SavedVectorizer.transform(testdata)

#modeling-NaiveBayes  
clf = MultinomialNB()
clf.fit(traindataCV, label_train) 

#cross validation score
CrossValScore = cross_val_score(clf, traindataCV, label_train, cv=10)
print("Accuracy CrossValScore: %0.3f" %CrossValScore.mean())

我也尝试过这种方式,我也得到了相同的结果(0.92)。即使我更改折叠次数或将其删除,也会发生这种情况。

from sklearn.model_selection import KFold
CrossValScore = cross_val_score(clf, traindataCV, label_train, cv=KFold(10, shuffle=False, random_state=0))
print("Accuracy CrossValScore: %0.3f" %CrossValScore.mean())

以下是一些示例:

traindata= ['ucg investment bank studying unicredit intesa paschi merger sole', 'mtoken sredstva autentifikacije intesa line umesto mini cda cega line vise moze koristi aktivacija', 'pll intesa', 'intesa and unicredit banka asset management the leading italia lenders are both after more fee income but url', 'about write intesa scene colbie cailat fosterthepeople that involves sexy taj between these url']

testdata= ['naumovic samo privilegovani nije delatnosti moci imati hit nama traziti depozit rimuje mentionpositive', 'breaking unicredit board okays launch bad loans vehicle with intesa kkr read more url', 'postoji promocija kupovina telefon rate telefon banka popust pretplata url', 'direktor politike haha struja obecao stan svi zaposliti kredit komercijalna banka', 'forex update unicredit and intesa pool bln euros bad loans kkr vehicle url']

label_train=[0 1 0 0 0]
label_test=[1 0 1 1 0]

0 个答案:

没有答案