我正在使用文本功能测试多标签分类问题。我共有1503个文本文档。每次手动运行脚本时,我的模型都显示结果略有不同。我不确定我的模特是否适合,或者这是否正常,因为我是初学者。
http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html
我使用以下博客中的确切脚本构建了模型。一个变体是我使用来自scikit learn的线性SVC
我的准确度分数在89到90之间,Kappa在87到88之间。应该做些什么修改以使其稳定?
这是2次手动
的示例Total emails classified: 1503
F1 Score: 0.902158940397
classification accuracy: 0.902158940397
kappa accuracy: 0.883691169128
precision recall f1-score support
Arts 0.916 0.878 0.897 237
Music 0.932 0.916 0.924 238
News 0.828 0.876 0.851 242
Politics 0.937 0.900 0.918 230
Science 0.932 0.791 0.855 86
Sports 0.929 0.948 0.938 233
Technology 0.874 0.937 0.904 237
avg / total 0.904 0.902 0.902 1503
Second run
Total emails classified: 1503
F1 Score: 0.898181015453
classification accuracy: 0.898181015453
kappa accuracy: 0.879002051427
以下是代码
def compute_classification():
#- 1. Load dataset
data = DataFrame({'text': [], 'class': []})
for path, classification in SOURCES:
data = data.append(build_data_frame(path, classification))
data = data.reindex(numpy.random.permutation(data.index))
#- 2. Apply different classification methods
"""
SVM
"""
pipeline = Pipeline([
# SVM using TfidfVectorizer
('vectorizer', TfidfVectorizer(max_features = 25000, ngram_range=(1,2), sublinear_tf=True, max_df=0.95, min_df=2,stop_words=stop_words1)),
('clf', LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3))
])
#- 3. Perform K Fold Cross Validation
k_fold = KFold(n=len(data), n_folds=10)
f_score = []
c_accuracy = []
k_score = []
confusion = numpy.array([[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]])
y_predicted_overall = None
y_test_overall = None
for train_indices, test_indices in k_fold:
train_text = data.iloc[train_indices]['text'].values
train_y = data.iloc[train_indices]['class'].values.astype(str)
test_text = data.iloc[test_indices]['text'].values
test_y = data.iloc[test_indices]['class'].values.astype(str)
# Train the model
pipeline.fit(train_text, train_y)
# Predict test data
predictions = pipeline.predict(test_text)
confusion += confusion_matrix(test_y, predictions, binary=False)
score = f1_score(test_y, predictions, average='micro')
f_score.append(score)
caccuracy = metrics.accuracy_score(test_y, predictions)
c_accuracy.append(caccuracy)
kappa = cohen_kappa_score(test_y, predictions)
k_score.append(kappa)
# collect the y_predicted per fold
if y_predicted_overall is None:
y_predicted_overall = predictions
y_test_overall = test_y
else:
y_predicted_overall = numpy.concatenate([y_predicted_overall, predictions])
y_test_overall = numpy.concatenate([y_test_overall, test_y])
# Print Metrics
print_metrics(data,k_score,c_accuracy,y_predicted_overall,y_test_overall,f_score,confusion)
return pipeline
答案 0 :(得分:2)
您看到变异因为LinearSVC
uses a random number generator when fitting:
底层C实现使用随机数生成器 在拟合模型时选择特征。因此并不罕见 对于相同的输入数据,结果略有不同。如果说 发生时,尝试使用较小的
tol
参数。
您也可以尝试设置random_state
参数。事实上,大多数使用随机数生成器的sklearn对象都将random_state
作为可选参数。您可以传递RandomState
或int
种子的实例:
pipeline = Pipeline([
# SVM using TfidfVectorizer
('vectorizer', TfidfVectorizer(max_features = 25000, ngram_range=(1,2), sublinear_tf=True, max_df=0.95, min_df=2,stop_words=stop_words1)),
('clf', LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-5, random_state=42))
])
编辑:如评论中所述,cross_validation.KFold
also takes a random_state
parameter确定如何隔离数据。为确保可重复性,您还应将种子或RandomState
传递给KFold
。
关于第二个想法:KFold
的文档建议默认情况下不会随机化拆分,除非还指定shuffle=True
,所以我不知道是否以上建议会有所帮助。
AS A SIDE-NOTE :自版本0.18以来已弃用cross_validation.KFold
,因此我建议改为使用model_selection.KFold
:
from sklearn.model_selection import KFold
k_fold = KFold(n_splits=10, random_state=42)
...
for train_indices, test_indices in k_fold.split(data):