Question

我正在使用scikit-learn（sklearn）线性SVM（LinearSVC），我正在尝试删除10％最具预测性的功能，用于对3个类别（正面，负面和中立）进行情绪分析，以便查看如果我可以在进行域适应时防止过度拟合。我知道可以通过使用svm.LinearSVC（）。coef_来访问功能权重，但我不知道如何删除10％最具预测性的功能。有谁知道继续？提前，谢谢你的帮助。这是我的代码：

from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer as cv

# Using linear SVM classifier
clf = svm.LinearSVC()
# Count vectorizer used by the SVM classifier (select which ngrams to use here)
vec = cv(lowercase=True, ngram_range=(1,2)) 
# Fit count vectorizer with training text data
vec.fit(trainStringList) 
# X represents the text data from the respective datasets
# Transforms text into vectors for the training set
X_train = vec.transform(trainStringList) 
#transforms text into vectors for the test set
X_test = vec.transform(testStringList)   
# Y represents the labels from the respective datasets
# Converting labels from the respective data sets to integers (0="positive", 1= "neutral", 2= "negative")
Y_train = trainLabels 
Y_test = testLabels
# Fitting the training data to the linear SVM classifier 
clf.fit(X_train,Y_train)

for feature_vector in clf.coef_:
    ???

Answer 1

具有最高权重的系数将表示生成预测时的最高重要性。您可以删除与这些参数关联的功能。我不建议这样做。如果您的目标是减少过度拟合，则C是此模型中的正则化参数。在启动LinearSVC对象时提供更高的C值（默认值为1）：

clf = svm.LinearSVC(C=10)

您应该进行某种交叉验证，以确定超参数的最佳值，例如C.

如何在sklearn的线性SVM中删除10％最具预测性的特征

1 个答案: