决策树问题:为什么tree()不会为节点选择所有变量

时间:2018-06-06 01:24:04

标签: r decision-tree

enter image description here

我正在我的工作场所从事一个项目,我在决策树分析中遇到了一些问题。这不是家庭作业。 样本数据集

PRODUCT_SUB_LINE_DESCR    MAJOR_CATEGORY_DESCR       CUST_REGION_DESCR
SUNDRY                         SMALL EQUIP           NORTH EAST REGION
SUNDRY                         SMALL EQUIP           SOUTH EAST REGION
SUNDRY                         SMALL EQUIP           SOUTH EAST REGION
SUNDRY                         SMALL EQUIP           NORTH EAST REGION
SUNDRY                         PREVENTIVE            SOUTH CENTRAL REGION
SUNDRY                         PREVENTIVE            SOUTH EAST REGION
SUNDRY                         PREVENTIVE            SOUTH EAST REGION
SUNDRY                         SMALL EQUIP           NORTH CENTRAL REGION
SUNDRY                         SMALL EQUIP           MOUNTAIN WEST REGION
SUNDRY                         SMALL EQUIP           MOUNTAIN WEST REGION
SUNDRY                         COMPOSITE             NORTH CENTRAL REGION
SUNDRY                         COMPOSITE             NORTH CENTRAL REGION
SUNDRY                         COMPOSITE             OHIO VALLEY REGION
SUNDRY                         COMPOSITE             NORTH EAST REGION

Sales   QtySold      MFGCOST    MarginDollars   new_ProductName
209.97  3             134.55    72.72            no
-76.15  -1            -44.85    -30.4            no
275.6   2             162.5     109.84           no
138.7   1             81.25     55.82            no
226     2             136       87.28            no
115     1             68        45.64            no
210.7   2             136       71.98            no
29      1             18.85     9.77             no
29      1             18.85     9.77             no
46.32   2             37.7      7.86             no
159.86  1             132.4     24.81            no
441.3   2             264.8     171.2            no
209.62  1             132.4     74.57            no
209.62  1             132.4     74.57            no

1)我的树只有两个节点,这就是为什么

>summary(tree_model)
Classification tree:
tree(formula = new_ProductName ~ ., data = training_data)
Variables actually used in tree construction:
[1] "PRODUCT_SUB_LINE_DESCR"
Number of terminal nodes:  2 
Residual mean deviance:  0 = 0 / 41140 
Misclassification error rate: 0 = 0 / 41146 

2)我确实创建了一个新的数据框,其中只有级别低于22级的因子。有一个因子有25个级别,但树()没有给出错误,所以我认为该算法接受25个级别

>str(new_Dataset)
'data.frame':   51433 obs. of  7 variables:
 $ PRODUCT_SUB_LINE_DESCR: Factor w/ 3 levels "Handpieces","PRIVATE 
                             LABEL",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ MAJOR_CATEGORY_DESCR  : Factor w/ 25 levels "AIR ABRASION",..: 23 23 23 
                             23 21 21 21 23 23 23 ...
 $ CUST_REGION_DESCR     : Factor w/ 7 levels "MOUNTAIN WEST REGION",..: 3 
                             6 6 3 5 6 6 2 1 1 ...
 $ Sales                 : num  210 -76.2 275.6 138.7 226 ...
 $ QtySold               : int  3 -1 2 1 2 1 2 1 1 2 ...
 $ MFGCOST               : num  134.6 -44.9 162.5 81.2 136 ...
 $ MarginDollars         : num  72.7 -30.4 109.8 55.8 87.3 ...

3)以下是我设置分析的方法

 # I choose product name as my main attribute(maybe that is why it appears at 
 the root node?)
 new_ProductName = ifelse( PRODUCT_SUB_LINE_DESCR == "PRIVATE 
                              LABEL","yes","no")
 data = data.frame(new_Dataset, new_ProductName)
 set.seed(100)
 train = sample(1:nrow(data), 0.8*nrow(data)) # training row indices
 training_data = data[train,] # training data
 testing_data = data[-train,] # testing data

 #fit the tree model using training data
 tree_model = tree(new_ProductName ~.,data = training_data)
 summary(tree_model)
 plot(tree_model)
 text(tree_model, pretty = 0)
 out = predict(tree_model) # predict the training data
 # actuals
 input.newproduct = as.character(training_data$new_ProductName)
 # predicted
 pred.newproduct = colnames(out)[max.col(out,ties.method = c("first"))] 
 mean (input.newproduct != pred.newproduct) # misclassification % 

# Cross Validation to see how much we need to prune the tree
set.seed(400)
cv_Tree = cv.tree(tree_model, FUN = prune.misclass) # run cross validation
attach(cv_Tree)
plot(cv_Tree) # plot the CV
plot(size, dev, type = "b")
# set size corresponding to lowest value in the plot above.
treePruneMod = prune.misclass(tree_model, best = 9) plot(treePruneMod)
text(treePruneMod, pretty = 0)
out = predict(treePruneMod) # fit the pruned tree
# Predicted
pred.newproduct = colnames(out)[max.col(out,ties.method = c("random"))] 
# calculate Mis-classification error
mean(training_data$new_ProductName != pred.newproduct)
# Predict testData with Pruned tree
out = predict(treePruneMod, testing_data, type = "class") 

4)我以前从未这样做过。我观看了几个YouTube视频并开始这样做。我欢迎很好的建议,解释和批评,请帮助我完成这个过程。这给我带来了挑战。

> table(data$PRODUCT_SUB_LINE_DESCR, data$new_ProductName)

                  no      yes
  Handpieces      164     0
  PRIVATE LABEL   0       14802
  SUNDRY          36467    0

1 个答案:

答案 0 :(得分:0)

简而言之,Trees的工作原理是找到一个变量,它在每个节点上给出最好的*分割(即在两个类之间进行区分),如果它是纯粹的,则终止分支。

对于您的问题,算法评估“PRODUCT_SUB_LINE_DESCR”是要拆分的最佳变量,并在任一侧生成纯分支,因此不需要进一步拆分。

这是由于你如何定义你的课程,你的直觉是正确的:

# I choose product name as my main attribute (maybe that is why it appears at 
# the root node?)
new_ProductName = ifelse( PRODUCT_SUB_LINE_DESCR == "PRIVATE 
                          LABEL","yes","no")

通过上面的代码/规则,您可以同时定义类和最佳分割。此时,基于树的分类等同于基于简单规则的分类。不是个好主意。

你应该先思考一下你想要达到的目标。如果要预测其他属性的产品名称。然后在从数据框创建类(即“new_ProductName”)后删除“PRODUCT_SUB_LINE_DESCR”列,然后运行树分类。

*注意:最佳拆分基于信息增益或基尼指数。