我正在尝试使用交叉验证折叠生成随机森林的特征重要性图。当仅使用特征(X)和目标(y)数据时,实现非常简单,例如:
rfc = RandomForestClassifier()
rfc.fit(X, y)
importances = pd.DataFrame({'FEATURE':data_x.columns,'IMPORTANCE':np.round(rfc.feature_importances_,3)})
importances = importances.sort_values('IMPORTANCE',ascending=False).set_index('FEATURE')
print(importances)
importances.plot.bar()
plt.show()
但是,我如何转换此代码以为将要创建的每个交叉验证折痕(k折)创建相似的图?
我目前拥有的代码是:
# Empty list storage to collect all results for displaying as plots
mylist = []
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
kf = KFold(n_splits=3)
for train, test in kf.split(X, y):
train_data = np.array(X)[train]
test_data = np.array(y)[test]
for rfc = RandomForestClassifier():
rfc.fit(train_data, test_data)
例如,上面的代码使用交叉验证技术创建(3折),我的目标是为所有3折创建特征重要度图,从而生成3个特征重要度图。此刻,这给了我循环错误。
我不确定使用每个已创建的(k折)分别通过随机森林为每个(k折)生成特征重要性图的最有效技术。
答案 0 :(得分:2)
导致此错误的原因之一是此代码rfc.fit(train_data, test_data)
。您应该将火车标签作为第二个参数,而不是测试数据。
至于绘图,您可以尝试执行以下代码。我假设您知道,在这种情况下,k折CV仅用于选择不同的训练数据集。测试数据将被忽略,因为没有进行预测:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.datasets import make_classification
# dummy classification dataset
X, y = make_classification(n_features=10)
# dummy feature names
feature_names = ['F{}'.format(i) for i in range(X.shape[1])]
kf = KFold(n_splits=3)
rfc = RandomForestClassifier()
count = 1
# test data is not needed for fitting
for train, _ in kf.split(X, y):
rfc.fit(X[train, :], y[train])
# sort the feature index by importance score in descending order
importances_index_desc = np.argsort(rfc.feature_importances_)[::-1]
feature_labels = [feature_names[i] for i in importances_index_desc]
# plot
plt.figure()
plt.bar(feature_labels, rfc.feature_importances_[importances_index_desc])
plt.xticks(feature_labels, rotation='vertical')
plt.ylabel('Importance')
plt.xlabel('Features')
plt.title('Fold {}'.format(count))
count = count + 1
plt.show()
答案 1 :(得分:2)
这是对我有用的代码:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.datasets import make_classification
# classification dataset
data_x, data_y = make_classification(n_features=9)
# feature names must be declared outside the function
# feature_names = list(data_x.columns)
kf = KFold(n_splits=10)
rfc = RandomForestClassifier()
count = 1
# test data is not needed for fitting
for train, _ in kf.split(data_x, data_y):
rfc.fit(data_x[train, :], data_y[train])
# sort the feature index by importance score in descending order
importances_index_desc = np.argsort(rfc.feature_importances_)[::-1]
feature_labels = [feature_names[-i] for i in importances_index_desc]
# plot
plt.figure()
plt.bar(feature_labels, rfc.feature_importances_[importances_index_desc])
plt.xticks(feature_labels, rotation='vertical')
plt.ylabel('Importance')
plt.xlabel('Features')
plt.title('Fold {}'.format(count))
count = count + 1
plt.show()