Question

我正在尝试使用C4.5算法为学校项目创建决策树。决策树适用于Haberman's Survival Data Set，属性信息如下。

Attribute Information:

1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute)
    1 = the patient survived 5 years or longer
    2 = the patient died within 5 year

我们需要实现一个决策树，其中每个叶子必须有一个不同的结果（意味着该叶子的熵应该为0），但是有六个实例存在相同的属性，但结果不同。

例如：

66,58,0,2
66,58,0,1

C4.5算法在这些情况下做了什么，我到处搜索但找不到任何信息。

感谢。

Answer 1

阅读Quinlan，J。R. C4.5：机器学习程序。 Morgan Kaufmann Publishers，1993年。（如果有大学作业，最好学习C4.5）

从我的研究。似乎在第137页上，源代码列出了build.c
有一行
//* if all case are the same.... or there are not enough case to divide（如您的问题）
它会return Node
该节点来自
Node = Leaf(ClassFreq, BestClass, Cases, Cases-NoBestClass);

ClassFreq存储每个类的计数
BestClass存储，即主导类（大多数频率）案例存储那里有多少数据
NoBestClass存储BestClass的多少数据

此Leaf函数来自文件Trees.c，此Leaf函数将返回一个带有bestClass (Best class become the leaf)叶子的节点。

所有这些信息参考 Quinlan，J。R. C4.5：机器学习程序。摩根·考夫曼出版社，1993年。

知道这一点的任何人，如果我做错了，请发表评论。谢谢

C4.5算法如何处理具有相同属性但结果不同的数据？

1 个答案: