我正在我的工作场所从事一个项目,我在决策树分析中遇到了一些问题。这不是家庭作业。 样本数据集
PRODUCT_SUB_LINE_DESCR MAJOR_CATEGORY_DESCR CUST_REGION_DESCR
SUNDRY SMALL EQUIP NORTH EAST REGION
SUNDRY SMALL EQUIP SOUTH EAST REGION
SUNDRY SMALL EQUIP SOUTH EAST REGION
SUNDRY SMALL EQUIP NORTH EAST REGION
SUNDRY PREVENTIVE SOUTH CENTRAL REGION
SUNDRY PREVENTIVE SOUTH EAST REGION
SUNDRY PREVENTIVE SOUTH EAST REGION
SUNDRY SMALL EQUIP NORTH CENTRAL REGION
SUNDRY SMALL EQUIP MOUNTAIN WEST REGION
SUNDRY SMALL EQUIP MOUNTAIN WEST REGION
SUNDRY COMPOSITE NORTH CENTRAL REGION
SUNDRY COMPOSITE NORTH CENTRAL REGION
SUNDRY COMPOSITE OHIO VALLEY REGION
SUNDRY COMPOSITE NORTH EAST REGION
Sales QtySold MFGCOST MarginDollars new_ProductName
209.97 3 134.55 72.72 no
-76.15 -1 -44.85 -30.4 no
275.6 2 162.5 109.84 no
138.7 1 81.25 55.82 no
226 2 136 87.28 no
115 1 68 45.64 no
210.7 2 136 71.98 no
29 1 18.85 9.77 no
29 1 18.85 9.77 no
46.32 2 37.7 7.86 no
159.86 1 132.4 24.81 no
441.3 2 264.8 171.2 no
209.62 1 132.4 74.57 no
209.62 1 132.4 74.57 no
1)我的树只有两个节点,这就是为什么
>summary(tree_model)
Classification tree:
tree(formula = new_ProductName ~ ., data = training_data)
Variables actually used in tree construction:
[1] "PRODUCT_SUB_LINE_DESCR"
Number of terminal nodes: 2
Residual mean deviance: 0 = 0 / 41140
Misclassification error rate: 0 = 0 / 41146
2)我确实创建了一个新的数据框,其中只有级别低于22级的因子。有一个因子有25个级别,但树()没有给出错误,所以我认为该算法接受25个级别
>str(new_Dataset)
'data.frame': 51433 obs. of 7 variables:
$ PRODUCT_SUB_LINE_DESCR: Factor w/ 3 levels "Handpieces","PRIVATE
LABEL",..: 3 3 3 3 3 3 3 3 3 3 ...
$ MAJOR_CATEGORY_DESCR : Factor w/ 25 levels "AIR ABRASION",..: 23 23 23
23 21 21 21 23 23 23 ...
$ CUST_REGION_DESCR : Factor w/ 7 levels "MOUNTAIN WEST REGION",..: 3
6 6 3 5 6 6 2 1 1 ...
$ Sales : num 210 -76.2 275.6 138.7 226 ...
$ QtySold : int 3 -1 2 1 2 1 2 1 1 2 ...
$ MFGCOST : num 134.6 -44.9 162.5 81.2 136 ...
$ MarginDollars : num 72.7 -30.4 109.8 55.8 87.3 ...
3)以下是我设置分析的方法
# I choose product name as my main attribute(maybe that is why it appears at
the root node?)
new_ProductName = ifelse( PRODUCT_SUB_LINE_DESCR == "PRIVATE
LABEL","yes","no")
data = data.frame(new_Dataset, new_ProductName)
set.seed(100)
train = sample(1:nrow(data), 0.8*nrow(data)) # training row indices
training_data = data[train,] # training data
testing_data = data[-train,] # testing data
#fit the tree model using training data
tree_model = tree(new_ProductName ~.,data = training_data)
summary(tree_model)
plot(tree_model)
text(tree_model, pretty = 0)
out = predict(tree_model) # predict the training data
# actuals
input.newproduct = as.character(training_data$new_ProductName)
# predicted
pred.newproduct = colnames(out)[max.col(out,ties.method = c("first"))]
mean (input.newproduct != pred.newproduct) # misclassification %
# Cross Validation to see how much we need to prune the tree
set.seed(400)
cv_Tree = cv.tree(tree_model, FUN = prune.misclass) # run cross validation
attach(cv_Tree)
plot(cv_Tree) # plot the CV
plot(size, dev, type = "b")
# set size corresponding to lowest value in the plot above.
treePruneMod = prune.misclass(tree_model, best = 9) plot(treePruneMod)
text(treePruneMod, pretty = 0)
out = predict(treePruneMod) # fit the pruned tree
# Predicted
pred.newproduct = colnames(out)[max.col(out,ties.method = c("random"))]
# calculate Mis-classification error
mean(training_data$new_ProductName != pred.newproduct)
# Predict testData with Pruned tree
out = predict(treePruneMod, testing_data, type = "class")
4)我以前从未这样做过。我观看了几个YouTube视频并开始这样做。我欢迎很好的建议,解释和批评,请帮助我完成这个过程。这给我带来了挑战。
> table(data$PRODUCT_SUB_LINE_DESCR, data$new_ProductName)
no yes
Handpieces 164 0
PRIVATE LABEL 0 14802
SUNDRY 36467 0
答案 0 :(得分:0)
简而言之,Trees的工作原理是找到一个变量,它在每个节点上给出最好的*分割(即在两个类之间进行区分),如果它是纯粹的,则终止分支。
对于您的问题,算法评估“PRODUCT_SUB_LINE_DESCR”是要拆分的最佳变量,并在任一侧生成纯分支,因此不需要进一步拆分。
这是由于你如何定义你的课程,你的直觉是正确的:
# I choose product name as my main attribute (maybe that is why it appears at
# the root node?)
new_ProductName = ifelse( PRODUCT_SUB_LINE_DESCR == "PRIVATE
LABEL","yes","no")
通过上面的代码/规则,您可以同时定义类和最佳分割。此时,基于树的分类等同于基于简单规则的分类。不是个好主意。
你应该先思考一下你想要达到的目标。如果要预测其他属性的产品名称。然后在从数据框创建类(即“new_ProductName”)后删除“PRODUCT_SUB_LINE_DESCR”列,然后运行树分类。
*注意:最佳拆分基于信息增益或基尼指数。