我最近一直在阅读关于ID3算法的文章,并且它说要选择用于分割的最佳属性应该导致最大的信息增益,这可以借助于熵来计算。
我编写了一个简单的python程序来计算熵。如下所示:
def _E(p, n):
x = (p/(p+n))
y = (n/(p+n))
return(-1* (x*math.log2(x)) -1* (y*math.log2(y)))
但是假设我们有一个由10个元素组成的表,如下所示:
x = [1,0,1,0,0,0,0,0,0,0]
y = [1,1,1,0,1,0,1,0,1,0]
其中x是属性,y是类。这里P(0)= 0.8,P(1)= 0.2。熵如下:
熵(x)= 0.8 * _E(5,3)+ 0.2 * _E(2,0)
然而,第二个分裂P(1)被完全分类,这导致数学错误,因为log2(0)是负无穷大。在这种情况下如何计算熵?
答案 0 :(得分:2)
熵是杂质的量度。因此,如果节点是纯的,则意味着熵为零。
查看this -
def information_gain(data, column, cut_point):
"""
For calculating the goodness of a split. The difference of the entropy of parent and
the weighted entropy of children.
:params:attribute_index, labels of the node t as `labels` and cut point as `cut_point`
:returns: The net entropy of partition
"""
subset1, subset2 = divide_data(data, column, cut_point)
lensub1, lensub2 = len(subset1), len(subset2)
#if the node is pure return 0 entropy
if len(subset1) == 0 or len(subset2) == 0:
return (0, subset1, subset2)
weighted_ent = (len(subset1)*entropy(subset1) + len(subset2)*entropy(subset2)) / len(data)
return ((entropy(data) - weighted_ent), subset1, subset2)
答案 1 :(得分:1)