Question

我正在测试使用R编写决策树的程序，并决定使用UCI提供的here的汽车数据集。

根据作者，它具有7个属性，分别是：

CAR                      car acceptability
   . PRICE                  overall price
   . . buying               buying price
   . . maint                price of the maintenance
   . TECH                   technical characteristics
   . . COMFORT              comfort
   . . . doors              number of doors
   . . . persons            capacity in terms of persons to carry
   . . . lug_boot           the size of luggage boot
   . . safety               estimated safety of the car

所以我想使用DT作为分类器，以考虑购买价格，保养，舒适性，门，人，行李箱和安全性来获得汽车的可接受性。

首先，我将第一列提取为因变量，然后我注意到数据是按顺序排列的；取决于第一列的值（很高，很高，中等，很低）。因此，我决定对数据进行混洗。我的代码如下：

car_data<-read.csv("car.data")
library(C50)
set.seed(12345)
car_data_rand<-car_data[order(runif(1727)),]
car_data<-car_data_rand
car_data_train<-car_data[1:1500,]
car_data_test<-car_data[1501:1727,]
answer<-data_train$vhigh
answer_test<-data_test$vhigh
#deleting the dependent variable or y from the data
car_data_train$vhigh<-NULL
car_data_test$vhigh<-NULL
car_model<-C5.0(car_data_train,answer)
summary(car_model)

我在这里遇到一个严重的错误：

Evaluation on training data (1500 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

         7  967(64.5%)   <<

我在做什么错了？

Answer 1

在代码中间，您拥有data_train和data_test而不是car_data_train和car_data_test。
虽然错误率很高，但没有任何问题。请注意

1 - table(answer) / length(answer)
# answer
#      high       low       med     vhigh 
# 0.7466667 0.7566667 0.7426667 0.7540000

这意味着，如果您天真的总是猜到“低”，那么您的错误将是75.6％。因此，有改善了〜11.1％。数值偏低的事实意味着预测指标并不理想。

最后，存在不一致之处：您说要对汽车的可接受性建模，而代码是关于buying变量的。现在修复仅导致1.1％的错误。但是，在这种情况下，您的样本非常不平衡：

1 - table(answer) / length(answer)
# answer
#       acc      good     unacc     vgood 
# 0.7773333 0.9600000 0.3020000 0.9606667

也就是说，通过始终猜测unacc，您可能已经再次获得30.2％的错误。但是，29.1％的改善显然更大。

决策树应用于数据集的问题

1 个答案: