在Scikit-Learn中选择样本

时间:2015-07-02 19:00:06

标签: scikit-learn

有没有办法自动选择培训样本'从功能集合中获得更好的模型拟合(DT或SVM)?我知道选择'功能'。但我在谈论选择样本'选择功能后。

2 个答案:

答案 0 :(得分:1)

通常有两种方法可以选择功能:Univariate Feature SelectionL1-based Sparse Feature Selection

from sklearn.datasets import make_classification
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.svm import LinearSVC
import matplotlib.pyplot as plt
import numpy as np


# simulate some artificial data: 2000 obs, features: 1000-dim
# but only 2 out 1000 features are informative, the rest 998 features are noises
X, y = make_classification(n_samples=2000, n_features=1000, n_informative=2, random_state=0)
X.shape

Out[153]: (2000, 1000)

# Univariate Feature Selection: select 20 best from 1000 features
# ==========================================================================
# classification F-test
X_selected = SelectKBest(f_classif, k=20).fit_transform(X, y)
X_selected.shape
# or to visualize each f-score/p-value of 1000 features
X_f_scores, X_f_pval = f_classif(X, y)
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(X_f_scores)
ax.set_title('Univariate Feature Selection: Classification F-Score')
ax.set_xlabel('features')
ax.set_ylabel('F-score')
# which features are most important: top 10
np.argsort(X_f_scores)[-10:]  # argsort is from smallest to largest

Out[154]: array([940, 163, 574, 969, 994, 977, 360, 291, 838, 524])

enter image description here

# L1-based Sparse Feature Selection: any algo implementation penalty 'l1'
# ==========================================================================
# use LinearSVC for example here
# other popular choices: logistic regression, Lasso (for regression)
feature_selector = LinearSVC(C=0.01, penalty='l1', dual=False)
feature_selector.fit(X, y)
# get features with non-zero coefficients: exactly 2
(feature_selector.coef_ != 0.0).sum()

Out[155]: 2

X_selected_l1 = feature_selector.transform(X)
# or X[:, feature_selector.coef_ != 0.0]

答案 1 :(得分:1)

有几种不同的方法可以将您的集合拆分为培训,测试和交叉验证集。查看sklearn.cross_validation.train_test_split。但另请参阅SK-Learn中也提供的plethora of advanced splitting methods

以下是test_train_split的示例:

In:
import numpy as np
from sklearn.cross_validation import train_test_split
a, b = np.arange(10).reshape((5, 2)), range(5)
a

Out:
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In: 
list(b)


Out:
[0, 1, 2, 3, 4]

In:
a_train, a_test, b_train, b_test = train_test_split(a, b, test_size=0.33, random_state=42)
a_train

Out:
array([[4, 5],
       [0, 1],
       [6, 7]])

In:
b_train

Out:
[2, 0, 3]

In:
a_test

Out:
array([[2, 3],
       [8, 9]])

In:
b_test

Out:
[1, 4]