我尝试使用weka提供的信息增益库来评估选择的属性,但如果决定实例如何分类的属性并不总是提供值,则它不起作用。这很难解释,所以这里有一个例子:
这是我的关系。它由3个数字属性和一个class属性组成。如果属性3取值5000,则分类设置为失败。
@relation TEST
@attribute attr1 numeric
@attribute attr2 numeric
@attribute attr3 numeric
@attribute class {failing,correct}
@data
8, 0.519674, 5000, failing
?, 6.78149, ?, correct
?, 7.384081, 5000, failing
21, ?, ?, correct
5, 1.016151, 5000, failing
执行信息增益属性评估程序后,这是输出:
=== Attribute Selection on all input data ===
Search Method:
Attribute ranking.
Attribute Evaluator (supervised, Class (nominal): 4 class):
Information Gain Ranking Filter
Ranked attributes:
0.249 1 attr1
0 3 attr3
0 2 attr2
Selected attributes: 1,3,2 : 3
现在,属性3应该是最高排名,因为它的值决定实例是被归类为失败还是正确。但事实并非如此。
所以,我的问题:如何告诉WEKA在计算信息收益时使用缺失值?
一种可能性是用这样的常量替换缺失值:
@relation TEST
@attribute attr1 numeric
@attribute attr2 numeric
@attribute attr3 numeric
@attribute class {failing,correct}
37, 9.295889, 5000, failing
48, ?, 0, correct
35, 14.722155, 5000, failing
?, 11.417347, 0, correct
?, 4.539502, 5000, failing
然后排名有效:
=== Attribute Selection on all input data ===
Search Method:
Attribute ranking.
Attribute Evaluator (supervised, Class (nominal): 4 class):
Information Gain Ranking Filter
Ranked attributes:
0.971 3 attr3
0.249 1 attr1
0 2 attr2
Selected attributes: 3,1,2 : 3
但那不是我想要的,因为我无法预测我的属性3将具有的值。
这是我的代码:
public static void test() {
FastVector attributes = new FastVector();
Random rand = new Random();
Attribute attr1 = new Attribute("attr1");
Attribute attr2 = new Attribute("attr2");
Attribute attr3 = new Attribute("attr3");
attributes.addElement(attr1);
attributes.addElement(attr2);
attributes.addElement(attr3);
FastVector classValues = new FastVector(2);
classValues.addElement("failing");
classValues.addElement("correct");
Attribute classAttribute = new Attribute("class", classValues);
attributes.addElement(classAttribute);
Instances instances = new Instances("TEST", attributes, 5);
for (int i = 0; i < 5; i++) {
Instance instance = new Instance(4);
instance.setDataset(instances);
if (i % (rand.nextInt(4) + 1) == 0)
instance.setValue(attr1, rand.nextInt(50));
if (i % (rand.nextInt(4) + 1) == 0)
instance.setValue(attr2, rand.nextFloat() * 15);
if (i % 2 == 0) {
instance.setValue(attr3, 5000);
instance.setValue(classAttribute, "failing");
} else {
//instance.setValue(attr3, 0);
instance.setValue(classAttribute, "correct");
}
instances.add(instance);
}
instances.setClass(classAttribute);
instances.compactify();
System.out.println(instances);
try {
System.out.println(AttributeSelection.SelectAttributes(new InfoGainAttributeEval(), new String[]{"-s", "weka.attributeSelection.Ranker"}, instances));
} catch (Exception e) {
e.printStackTrace();
}
}
谢谢!