有没有办法自动选择培训样本'从功能集合中获得更好的模型拟合(DT或SVM)?我知道选择'功能'。但我在谈论选择样本'选择功能后。
答案 0 :(得分:1)
通常有两种方法可以选择功能:Univariate Feature Selection
和L1-based Sparse Feature Selection
。
from sklearn.datasets import make_classification
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.svm import LinearSVC
import matplotlib.pyplot as plt
import numpy as np
# simulate some artificial data: 2000 obs, features: 1000-dim
# but only 2 out 1000 features are informative, the rest 998 features are noises
X, y = make_classification(n_samples=2000, n_features=1000, n_informative=2, random_state=0)
X.shape
Out[153]: (2000, 1000)
# Univariate Feature Selection: select 20 best from 1000 features
# ==========================================================================
# classification F-test
X_selected = SelectKBest(f_classif, k=20).fit_transform(X, y)
X_selected.shape
# or to visualize each f-score/p-value of 1000 features
X_f_scores, X_f_pval = f_classif(X, y)
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(X_f_scores)
ax.set_title('Univariate Feature Selection: Classification F-Score')
ax.set_xlabel('features')
ax.set_ylabel('F-score')
# which features are most important: top 10
np.argsort(X_f_scores)[-10:] # argsort is from smallest to largest
Out[154]: array([940, 163, 574, 969, 994, 977, 360, 291, 838, 524])
# L1-based Sparse Feature Selection: any algo implementation penalty 'l1'
# ==========================================================================
# use LinearSVC for example here
# other popular choices: logistic regression, Lasso (for regression)
feature_selector = LinearSVC(C=0.01, penalty='l1', dual=False)
feature_selector.fit(X, y)
# get features with non-zero coefficients: exactly 2
(feature_selector.coef_ != 0.0).sum()
Out[155]: 2
X_selected_l1 = feature_selector.transform(X)
# or X[:, feature_selector.coef_ != 0.0]
答案 1 :(得分:1)
有几种不同的方法可以将您的集合拆分为培训,测试和交叉验证集。查看sklearn.cross_validation.train_test_split
。但另请参阅SK-Learn中也提供的plethora of advanced splitting methods。
以下是test_train_split
的示例:
In:
import numpy as np
from sklearn.cross_validation import train_test_split
a, b = np.arange(10).reshape((5, 2)), range(5)
a
Out:
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
In:
list(b)
Out:
[0, 1, 2, 3, 4]
In:
a_train, a_test, b_train, b_test = train_test_split(a, b, test_size=0.33, random_state=42)
a_train
Out:
array([[4, 5],
[0, 1],
[6, 7]])
In:
b_train
Out:
[2, 0, 3]
In:
a_test
Out:
array([[2, 3],
[8, 9]])
In:
b_test
Out:
[1, 4]