绘制重要性变量xgboost Python

时间:2018-08-18 05:22:16

标签: python matplotlib machine-learning xgboost feature-selection

当我绘制功能重要性图时,会出现混乱的图。我有7000多个变量。我了解内置功能只会选择最重要的功能,尽管最终图形不可读。 这是完整的代码:

import numpy as np
import pandas as pd
df = pd.read_csv('ricerice.csv')
array=df.values
X = array[:,0:7803]
Y = array[:,7804]
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
seed=0
test_size=0.30
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=test_size, random_state=seed)
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X, Y)
import matplotlib.pyplot as plt
from matplotlib import pyplot
from xgboost import plot_importance
fig1=plt.gcf()
plot_importance(model)
plt.draw()
fig1.savefig('xgboost.png', figsize=(50, 40), dpi=1000)

尽管图形大小大,但图形难以辨认。 xgboost feature importance plot

2 个答案:

答案 0 :(得分:5)

有两点:

  1. 要拟合模型,您要使用训练数据集(X_train, y_train),而不是整个数据集(X, y)。
  2. 您可以使用max_num_features函数的plot_importance()参数来仅显示前max_num_features个功能(例如前10个)。

在对代码进行上述修改后,使用一些随机生成的数据,代码和输出如下:

import numpy as np

# generate some random data for demonstration purpose, use your original dataset here
X = np.random.rand(1000,100)     # 1000 x 100 data
y = np.random.rand(1000).round() # 0, 1 labels

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
seed=0
test_size=0.30
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=test_size, random_state=seed)
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)
import matplotlib.pylab as plt
from matplotlib import pyplot
from xgboost import plot_importance
plot_importance(model, max_num_features=10) # top 10 most important features
plt.show()

enter image description here

答案 1 :(得分:0)

您可以通过具有feature_importances_属性的Xgboost模型获得功能重要性。就您而言,它将是:

model.feature_imortances_

此属性是对每个功能具有gain重要性的数组。然后可以绘制它:

from matplotlib import pyplot as plt
plt.barh(feature_names, model.feature_importances_)

({feature_names是具有功能名称的列表)

您可以对数组进行排序并选择所需的功能数(例如10):

sorted_idx = model.feature_importances_.argsort()
plt.barh(feature_names[sorted_idx][:10], model.feature_importances_[sorted_idx][:10])
plt.xlabel("Xgboost Feature Importance")

还有两种获取功能重要性的方法:

  • 您可以使用permutation_importance中的scikit-learn(从0.22版开始)
  • 您可以使用SHAP值

您可以在我的blog post中阅读更多内容。