训练sklearn的SVC只能产生'积极'

时间:2015-03-13 23:03:40

标签: python scikit-learn svm

我正在训练一个由TfidfVectorizer获得的功能的SVM。当通过要求预测来测试SVM时,甚至来自用于训练的条目的特征向量被标记为“否定”。将导致“积极的”#39;预测。我感觉我做了一些基本的错误 - 但是我无法从文档中找出它是什么。

代码或多或少是这样的:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

data = load_data()                                  # a list of tuples, at position 0 is some text, at position 1 a label -- either 'positive' or 'negative'. Order is randomized.
vocab = {ch for entry in data for ch in entry[0]}   # the vocabulary


extractor = TfidfVectorizer(strip_accents='ascii', analyzer='char',
                            vocabulary=vocab, ngram_range=(1, 5), 
                            min_df=2, lowercase=False)
features, labels = extractor.fit_transform([entry[0] for entry in entries]), \
        [entry[1] for entry in entries]

clf = SVC()
clf.fit(features, labels)

for feature in features:
    print(clf.predict(feature))                      # testing on training data, half of the entries should be 'negative', but it always prints 'positive'

要给出数据印象,每个标签有两个条目:

  

正(0,1)0.15046358725
(0,3)0.431231348393
(0,6)0.126073691443
(0,7)0.053403320129
(0 ,8)0.172907188257
(0,9)0.176318488739
(0,10)0.0822510699681
(0,11)0.0750035434541
(0,12)0.245746908087
(0 ,13)0.070393261049
(0,14)0.217021712559
(0,15)0.0348732598324
(0,17)0.330453439288
(0,18)0.0801049801935
(0 ,19)0.121622267101
(0,20)0.155054690124
(0,21)0.105138945977
(0,22)0.104318311782
(0,23)0.142275533299
(0 ,25)0.114477206411
(0,27)0.160209505382
(0,28)0.129046778512
(0,29)0.0618410863719
(0,30)0.322325274638
(0 ,31)0.0341389957579
(0,32)0.310109380247
(0,33)0.112336563455
(0,34)0.0662718061209
(0,35)0.301680645638
(0 ,36)0.070241173501
(0,37)0.0490111226972
(0,38)0.0979593205615
(0,39)0.0596 363664168个

     

正(0,1)0.117625753539
(0,3)0.393303780468
(0,6)0.0919882279376
(0,7)0.146119207993
(0 ,8)0.13517116455
(0,9)0.402027406205
(0,10)0.150033882678
(0,12)0.0896532112974
(0,13)0.0642020479106
(0 ,17)0.263715785035
(0,18)0.0487064026643
(0,19)0.0443701485229
(0,21)0.0479458938703
(0,22)0.190286659581
(0 ,23)0.0865080972171
(0,24)0.0593888745322
(0,25)0.156613097216
(0,27)0.37573403916
(0,28)0.0941575069348
(0 ,29)0.112804104567
(0,30)0.0734940476429
(0,32)0.0404049579898
(0,33)0.0512281809479
(0,34)0.0604430823106
(0 ,35)0.432374318506
(0,36)0.128126673468
(0,38)0.238249661904
(0,39)0.0543912413894

     

负数(0,1)0.0577944775799
(0,3)0.421629123125
(0,6)0.101694787822
(0,7)0.143589019178
(0 ,8)0.232453464603
(0,9)0.26666950341
(0,10)0.0368589769154
(0,12)0.165188968649
(0,13)0.0946354953804
(0 ,14)0.0364700345073
(0,15)0.0468830136663
(0,17)0.416489951344
(0,18)0.0717945095888
(0,19)0.098104136602
(0 ,20)0.0893369449956
(0,21)0.106010249646
(0,22)0.105182814722
(0,23)0.127515194624
(0,24)0.175081504231
(0 ,25)0.0384752992583
(0,27)0.123075952609
(0,28)0.20818593649
(0,29)0.0831380981142
(0,30)0.162498218497
(0 ,32)0.416905743313
(0,33)0.0755116766399
(0,34)0.0890946819135
(0,35)0.260726333632
(0,36)0.0629540209892
(0 ,38)0.0438982779995
(0,39)0.04008705517

     

负数(0,0)0.140625053372
(0,1)0.185208158007
(0,3)0.434299020013
(0,6)0.124148980319
(0 ,7)0.0657350441611
(0,8)0.121619636365
(0,9)0.217033390727
(0,10)0.0337480747364
(0,14)0.100175877346
(0 ,17)0.406760691514
(0,18)0.131470088322
(0,19)0.179648270846
(0,20)0.136328159775
(0,21)0.0323543238518
(0 ,22)0.160508953733
(0,23)0.116752896535
(0,24)0.0400761509423
(0,25)0.105683937825
(0,27)0.140860408865
(0 ,28)0.0635383392931
(0,29)0.0761212324217
(0,30)0.297566697793
(0,31)0.0840445271321
(0,32)0.38171884737
(0 ,33)0.172846204642
(0,34)0.0815750798167
(0,35)0.212196443599
(0,36)0.11528138777
(0,38)0.0803865158889
(0 ,39)0.0734074055798

1 个答案:

答案 0 :(得分:1)

您正在使用带有rbf内核的kernel-svm而不调整gamma或C.这很少有效。此外,rbf内核SVM实际上并不是文本数据的良好匹配。试试LinearSVC。