我使用以下博客中的确切脚本构建了模型。一个变体是我使用来自scikit learn的线性SVC



Total emails classified: 1503
F1 Score: 0.902158940397
classification accuracy: 0.902158940397
kappa accuracy: 0.883691169128

             precision    recall  f1-score   support

      Arts      0.916     0.878     0.897       237
     Music      0.932     0.916     0.924       238
      News      0.828     0.876     0.851       242
  Politics      0.937     0.900     0.918       230
   Science      0.932     0.791     0.855        86
    Sports      0.929     0.948     0.938       233
Technology      0.874     0.937     0.904       237

avg / total     0.904     0.902     0.902      1503

Second run
Total emails classified: 1503
F1 Score: 0.898181015453
classification accuracy: 0.898181015453
kappa accuracy: 0.879002051427


def compute_classification(): 

#- 1. Load dataset
data = DataFrame({'text': [], 'class': []})
for path, classification in SOURCES:
    data = data.append(build_data_frame(path, classification))
data = data.reindex(numpy.random.permutation(data.index))

#- 2. Apply different classification methods

pipeline = Pipeline([

# SVM using TfidfVectorizer
('vectorizer', TfidfVectorizer(max_features = 25000, ngram_range=(1,2), sublinear_tf=True, max_df=0.95, min_df=2,stop_words=stop_words1)),
('clf',       LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3))


#- 3. Perform K Fold Cross Validation
k_fold = KFold(n=len(data), n_folds=10)
f_score    = []
c_accuracy = []
k_score    = []
confusion  = numpy.array([[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]])
y_predicted_overall = None
y_test_overall      = None

for train_indices, test_indices in k_fold:

    train_text = data.iloc[train_indices]['text'].values
    train_y    = data.iloc[train_indices]['class'].values.astype(str)
    test_text  = data.iloc[test_indices]['text'].values
    test_y     = data.iloc[test_indices]['class'].values.astype(str)

    # Train the model
    pipeline.fit(train_text, train_y)

    # Predict test data
    predictions = pipeline.predict(test_text)

    confusion += confusion_matrix(test_y, predictions, binary=False)
    score = f1_score(test_y, predictions, average='micro')
    caccuracy = metrics.accuracy_score(test_y, predictions)
    kappa = cohen_kappa_score(test_y, predictions)

    # collect the y_predicted per fold
    if y_predicted_overall is None:
        y_predicted_overall = predictions
        y_test_overall = test_y
        y_predicted_overall = numpy.concatenate([y_predicted_overall, predictions])
        y_test_overall = numpy.concatenate([y_test_overall, test_y])

# Print Metrics

return pipeline

您看到变异因为LinearSVC uses a random number generator when fitting


底层C实现使用随机数生成器   在拟合模型时选择特征。因此并不罕见   对于相同的输入数据,结果略有不同。如果说   发生时,尝试使用较小的tol参数。


pipeline = Pipeline([

# SVM using TfidfVectorizer
('vectorizer', TfidfVectorizer(max_features = 25000, ngram_range=(1,2), sublinear_tf=True, max_df=0.95, min_df=2,stop_words=stop_words1)),
('clf',       LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-5, random_state=42))


编辑:如评论中所述,cross_validation.KFold also takes a random_state parameter确定如何隔离数据。为确保可重复性,您还应将种子或RandomState传递给KFold


AS A SIDE-NOTE :自版本0.18以来已弃用cross_validation.KFold,因此我建议改为使用model_selection.KFold

from sklearn.model_selection import KFold
k_fold = KFold(n_splits=10, random_state=42)
for train_indices, test_indices in k_fold.split(data):