Question

I am currently experimenting with a data set,using supervised learning with 10 features and 3 classes, but a question arose that is, what feature selection algorithm would I use to find out which feature impacts which class the most, or which combination of features will result in what class.

For Example Take a data set of Hours Slept and Hours studied which may result in a pass or a fail.

I want to know How does Hours Studies impact the pass class and how it impacts the fail class and the same for Hours Slept how does it impact pass or fail.

What Feature selection method will tell me that Hours Slept has x impact on Fail and y on Pass, and the same for Hours studied?

Answer 1

一种方法是观察在根据给定属性的属性值对类值进行分区后，类标签分布的熵如何变化。给出最大熵减少的属性是“最佳”熵。（这仅适用于离散属性;您必须对属性进行离散化才能使用此方法;例如，将hoursSlept>7转换为sleptAlot;将5 <=hoursSlept<=7转换为sleptEnough ;以及hoursSlept<5到sleepDeprived。）

离散分布H的熵(p1,p2,...,pk)定义为

H = -p1*log_2 p1 - p2*log_2 p2 - ... - pk*log_2 pk

粗略地说，它测量的是分布的杂质。关于结果越少，熵就越高;你可以越多地了解熵越小的结果。实际上，所有pi=1/k的分布i（所有结果都同样可能）具有最高可能的熵（值log_2 k）;以及某些pi=1的{{1}}具有最低可能熵（值i）的分布。

定义0其中pi=ni/n是示例数，n是ni - 类值的示例数。这会导致离散分布i，其中(p1,p2,...,pk)是类值的数量。

对于具有可能值k的属性A，将a1,a2,...,ar定义为属性Si的值等于A的那些示例的集合。每个集合ai都会产生离散分布（以与之前相同的方式定义）。设Si为集合|Si|中的示例数。用Si表示相应的熵。

现在计算

H(Si)

并选择最大化Gain(A) = H - |S1|/n * H(S1) - ... - |Sr|/n * H(Sr)的属性。直觉是最大化此差异的属性对示例进行分区，以便在大多数Gain(A)中示例具有相似的标签（即熵较低）。

直观地，Si的值告诉您关于类标签的属性Gain(A)的信息量如何。

供您参考，这被广泛用于决策树学习，该措施被称为信息增益。例如，见these slides;这个explanation on Math.SE真的很棒（虽然它是在决策树学习的背景下）。

What feature selection algorithm would I use to find out which feature impacts each class the most?

1 个答案: