如何在信息增益属性评估中处理缺失的属性值

时间:2014-10-21 16:12:37

标签: java weka information-theory

我尝试使用weka提供的信息增益库来评估选择的属性,但如果决定实例如何分类的属性并不总是提供值,则它不起作用。这很难解释,所以这里有一个例子:

这是我的关系。它由3个数字属性和一个class属性组成。如果属性3取值5000,则分类设置为失败。

@relation TEST

@attribute attr1 numeric
@attribute attr2 numeric
@attribute attr3 numeric
@attribute class {failing,correct}

@data
8,  0.519674, 5000, failing
?,  6.78149,  ?,    correct
?,  7.384081, 5000, failing
21, ?,        ?,    correct
5,  1.016151, 5000, failing

执行信息增益属性评估程序后,这是输出:

=== Attribute Selection on all input data ===

Search Method:
    Attribute ranking.

Attribute Evaluator (supervised, Class (nominal): 4 class):
    Information Gain Ranking Filter

Ranked attributes:
 0.249  1 attr1
 0      3 attr3
 0      2 attr2

Selected attributes: 1,3,2 : 3

现在,属性3应该是最高排名,因为它的值决定实例是被归类为失败还是正确。但事实并非如此。

所以,我的问题:如何告诉WEKA在计算信息收益时使用缺失值?

一种可能性是用这样的常量替换缺失值:

@relation TEST

@attribute attr1 numeric
@attribute attr2 numeric
@attribute attr3 numeric
@attribute class {failing,correct}

37, 9.295889,  5000, failing
48, ?,         0,    correct
35, 14.722155, 5000, failing
?,  11.417347, 0,    correct
?,  4.539502,  5000, failing

然后排名有效:

=== Attribute Selection on all input data ===

Search Method:
    Attribute ranking.

Attribute Evaluator (supervised, Class (nominal): 4 class):
    Information Gain Ranking Filter

Ranked attributes:
 0.971  3 attr3
 0.249  1 attr1
 0      2 attr2

Selected attributes: 3,1,2 : 3

但那不是我想要的,因为我无法预测我的属性3将具有的值。

这是我的代码:

 public static void test() {
        FastVector attributes = new FastVector();
        Random rand = new Random();

        Attribute attr1 = new Attribute("attr1");
        Attribute attr2 = new Attribute("attr2");
        Attribute attr3 = new Attribute("attr3");

        attributes.addElement(attr1);
        attributes.addElement(attr2);
        attributes.addElement(attr3);

        FastVector classValues = new FastVector(2);
        classValues.addElement("failing");
        classValues.addElement("correct");
        Attribute classAttribute = new Attribute("class", classValues);
        attributes.addElement(classAttribute);

        Instances instances = new Instances("TEST", attributes, 5);

        for (int i = 0; i < 5; i++) {
            Instance instance = new Instance(4);
            instance.setDataset(instances);

            if (i % (rand.nextInt(4) + 1) == 0)
                instance.setValue(attr1, rand.nextInt(50));

            if (i % (rand.nextInt(4) + 1) == 0)
                instance.setValue(attr2, rand.nextFloat() * 15);

            if (i % 2 == 0) {
                instance.setValue(attr3, 5000);
                instance.setValue(classAttribute, "failing");
            } else {
                //instance.setValue(attr3, 0);
                instance.setValue(classAttribute, "correct");
            }

            instances.add(instance);
        }

        instances.setClass(classAttribute);
        instances.compactify();
        System.out.println(instances);

        try {
            System.out.println(AttributeSelection.SelectAttributes(new InfoGainAttributeEval(), new String[]{"-s", "weka.attributeSelection.Ranker"}, instances));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

谢谢!

0 个答案:

没有答案