如何在sklearn的线性SVM中删除10%最具预测性的特征

时间:2014-08-22 21:32:34

标签: python python-2.7 scikit-learn svm sentiment-analysis

我正在使用scikit-learn(sklearn)线性SVM(LinearSVC),我正在尝试删除10%最具预测性的功能,用于对3个类别(正面,负面和中立)进行情绪分析,以便查看如果我可以在进行域适应时防止过度拟合。我知道可以通过使用svm.LinearSVC()。coef_来访问功能权重,但我不知道如何删除10%最具预测性的功能。有谁知道继续?提前,谢谢你的帮助。这是我的代码:

from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer as cv

# Using linear SVM classifier
clf = svm.LinearSVC()
# Count vectorizer used by the SVM classifier (select which ngrams to use here)
vec = cv(lowercase=True, ngram_range=(1,2)) 
# Fit count vectorizer with training text data
vec.fit(trainStringList) 
# X represents the text data from the respective datasets
# Transforms text into vectors for the training set
X_train = vec.transform(trainStringList) 
#transforms text into vectors for the test set
X_test = vec.transform(testStringList)   
# Y represents the labels from the respective datasets
# Converting labels from the respective data sets to integers (0="positive", 1= "neutral", 2= "negative")
Y_train = trainLabels 
Y_test = testLabels
# Fitting the training data to the linear SVM classifier 
clf.fit(X_train,Y_train)

for feature_vector in clf.coef_:
    ???

1 个答案:

答案 0 :(得分:0)

具有最高权重的系数将表示生成预测时的最高重要性。您可以删除与这些参数关联的功能。我不建议这样做。如果您的目标是减少过度拟合,则C是此模型中的正则化参数。在启动LinearSVC对象时提供更高的C值(默认值为1):

clf = svm.LinearSVC(C=10)

您应该进行某种交叉验证,以确定超参数的最佳值,例如C.