xgboost.plot_tree:二进制特征解释

时间:2018-09-13 13:06:06

标签: python machine-learning xgboost

我建立了一个XGBoost模型,并试图检查各个估计量。作为参考,这是具有离散和连续输入功能的二进制分类任务。输入要素矩阵是scipy.sparse.csr_matrix

但是,当我去检查单个估计量时,发现在解释二进制输入特征(例如下面的f60150)时遇到了困难。最底部图表中的实值f60150易于理解-其标准在该功能的预期范围内。但是,对二进制功能<X> < -9.53674e-07进行的比较没有意义。每个功能都是1或0。-9.53674e-07是一个非常小的负数,我想这只是XGBoost或其基础绘图库中的一些浮点特性,但是使用它没有任何意义。当功能始终为正时进行比较。有人可以帮助我了解哪个方向(即yes, missingno对应于这些二元特征节点的哪一侧是真/反一侧吗?

以下是可重现的示例:

import numpy as np
import scipy.sparse
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import plot_tree, XGBClassifier
import matplotlib.pyplot as plt

def booleanize_csr_matrix(mat):
    ''' Convert sparse matrix with positive integer elements to 1s '''
    nnz_inds = mat.nonzero()
    keep = np.where(mat.data > 0)[0]
    n_keep = len(keep)
    result = scipy.sparse.csr_matrix(
        (np.ones(n_keep), (nnz_inds[0][keep], nnz_inds[1][keep])),
        shape=mat.shape
    )
    return result

### Setup dataset
res = fetch_20newsgroups()

text = res.data
outcome = res.target

### Use default params from CountVectorizer to create initial count matrix
vec = CountVectorizer()
X = vec.fit_transform(text)

# Whether to "booleanize" the input matrix
booleanize = True

# Whether to, after "booleanizing", convert the data type to match what's returned by `vec.fit_transform(text)`
to_int = True

if booleanize and to_int:
    X = booleanize_csr_matrix(X)
    X = X.astype(np.int64)

# Make it a binary classification problem
y = np.where(outcome == 1, 1, 0)

# Random state ensures we will be able to compare trees and their features consistently
model = XGBClassifier(random_state=100)
model.fit(X, y)

plot_tree(model, rankdir='LR'); plt.show()

在将booleanizeto_int设置为True的情况下运行上述操作,将得到以下图表:

enter image description here

在将booleanizeto_int设置为False的情况下运行上述操作,将得到以下图表:

enter image description here

哎呀,即使我做了一个非常简单的示例,无论X还是y是整数还是浮点类型,我都能得到“正确”的结果。

X = np.matrix(
    [
        [1,0],
        [1,0],
        [0,1],
        [0,1],
        [1,1],
        [1,0],
        [0,0],
        [0,0],
        [1,1],
        [0,1]
    ]
)

y = np.array([1,0,0,0,1,1,1,0,1,1])

model = XGBClassifier(random_state=100)
model.fit(X, y)

plot_tree(model, rankdir='LR'); plt.show()

enter image description here

0 个答案:

没有答案