我的代码是否正确计算数据集的熵/条件熵?

时间:2015-11-04 20:37:38

标签: java machine-learning statistics artificial-intelligence probability

我正在编写一个我想要使用的java,以便在给定数据集时能够计算诸如熵,关节熵,条件熵等的东西。有问题的课程如下:

public class Entropy {

private Frequency<String> iFrequency = new Frequency<String>();
private Frequency<String> rFrequency = new Frequency<String>();

Entropy(){
    super();
}

public void setInterestedFrequency(List<String> interestedFrequency){
    for(String s: interestedFrequency){
        this.iFrequency.addValue(s);
    }
}

public void setReducingFrequency(List<String> reducingFrequency){
    for(String s:reducingFrequency){
        this.rFrequency.addValue(s);
    }
}

private double log(double num, int base){
   return Math.log(num)/Math.log(base);
}

public double entropy(List<String> data){

    double entropy = 0.0;
    double prob = 0.0;
    Frequency<String> frequency = new Frequency<String>();

    for(String s:data){
        frequency.addValue(s);
    }

    String[] keys = frequency.getKeys();

    for(int i=0;i<keys.length;i++){

        prob = frequency.getPct(keys[i]);
        entropy = entropy - prob * log(prob,2);
    }

    return entropy;
}

/*
* return conditional probability of P(interestedClass|reducingClass)
* */
public double conditionalProbability(List<String> interestedSet,
                                     List<String> reducingSet,
                                     String interestedClass,
                                     String reducingClass){
    List<Integer> conditionalData = new LinkedList<Integer>();

    if(iFrequency.getKeys().length==0){
        this.setInterestedFrequency(interestedSet);
    }

    if(rFrequency.getKeys().length==0){
        this.setReducingFrequency(reducingSet);
    }

    for(int i = 0;i<reducingSet.size();i++){
        if(reducingSet.get(i).equalsIgnoreCase(reducingClass)){
            if(interestedSet.get(i).equalsIgnoreCase(interestedClass)){
                conditionalData.add(i);
            }
        }
    }

    int numerator = conditionalData.size();
    int denominator = this.rFrequency.getNum(reducingClass);

    return (double)numerator/denominator;
}

public double jointEntropy(List<String> set1, List<String> set2){

    String[] set1Keys;
    String[] set2Keys;
    Double prob1;
    Double prob2;
    Double entropy = 0.0;

    if(this.iFrequency.getKeys().length==0){
        this.setInterestedFrequency(set1);
    }

    if(this.rFrequency.getKeys().length==0){
        this.setReducingFrequency(set2);
    }

    set1Keys = this.iFrequency.getKeys();
    set2Keys = this.rFrequency.getKeys();

    for(int i=0;i<set1Keys.length;i++){
        for(int j=0;j<set2Keys.length;j++){
            prob1 = iFrequency.getPct(set1Keys[i]);
            prob2 = rFrequency.getPct(set2Keys[j]);

            entropy = entropy - (prob1*prob2)*log((prob1*prob2),2);
        }
    }

    return entropy;
}

public double conditionalEntropy(List<String> interestedSet, List<String> reducingSet){

    double jointEntropy = jointEntropy(interestedSet,reducingSet);
    double reducingEntropyX = entropy(reducingSet);
    double conEntYgivenX = jointEntropy - reducingEntropyX;

    return conEntYgivenX;
}

在过去的几天里,我一直试图找出为什么我的熵计算几乎总是与我对条件熵的计算完全相同。

我使用以下公式:

H(X)= - Sigma从x = 1到x = n p(x)* log(p(x))

H(XY)= - Sigma从x = 1到x = n,y = 1到y = m(p(x)* p(y))* log(p(x)* p(y))

H(X | Y)= H(XY)-H(X)

我的熵和条件熵的值几乎相同。

使用我用于测试的数据集,我得到以下值:

@Test
public void testEntropy(){
    FileHelper fileHelper = new FileHelper();
    List<String> lines = fileHelper.readFileToMemory("");
    Data freshData = fileHelper.parseCSVData(lines);

    LinkedList<String> headersToChange = new LinkedList<String>();
    headersToChange.add("lwt");

    Data discreteData = freshData.discretize(freshData.getData(),headersToChange,1,10);

    Entropy entropy = new Entropy();
    Double result = entropy.entropy(discreteData.getData().get("lwt"));
    assertEquals(2.48,result,.006);
}

@Test
public void testConditionalProbability(){

    FileHelper fileHelper = new FileHelper();
    List<String> lines = fileHelper.readFileToMemory("");
    Data freshData = fileHelper.parseCSVData(lines);

    LinkedList<String> headersToChange = new LinkedList<String>();
    headersToChange.add("age");
    headersToChange.add("lwt");


    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10);

    Entropy entropy = new Entropy();
    double conditionalProb = entropy.conditionalProbability(discreteData.getData().get("lwt"),discreteData.getData().get("age"),"4","6");
    assertEquals(.1,conditionalProb,.005);
}

@Test
public void testJointEntropy(){


    FileHelper fileHelper = new FileHelper();
    List<String> lines = fileHelper.readFileToMemory("");
    Data freshData = fileHelper.parseCSVData(lines);

    LinkedList<String> headersToChange = new LinkedList<String>();
    headersToChange.add("age");
    headersToChange.add("lwt");

    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10);

    Entropy entropy = new Entropy();
    double jointEntropy = entropy.jointEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age"));
    assertEquals(5.05,jointEntropy,.006);
}

@Test
public void testSpecifiedConditionalEntropy(){

    FileHelper fileHelper = new FileHelper();
    List<String> lines = fileHelper.readFileToMemory("");
    Data freshData = fileHelper.parseCSVData(lines);

    LinkedList<String> headersToChange = new LinkedList<String>();
    headersToChange.add("age");
    headersToChange.add("lwt");

    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10);

    Entropy entropy = new Entropy();
    double specCondiEntropy = entropy.specifiedConditionalEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age"),"4","6");
    assertEquals(.332,specCondiEntropy,.005);

}

@Test
public void testConditionalEntropy(){

    FileHelper fileHelper = new FileHelper();
    List<String> lines = fileHelper.readFileToMemory("");
    Data freshData = fileHelper.parseCSVData(lines);

    LinkedList<String> headersToChange = new LinkedList<String>();
    headersToChange.add("age");
    headersToChange.add("lwt");

    Data discreteData = freshData.discretize(freshData.getData(), headersToChange, 1, 10);

    Entropy entropy = new Entropy();
    Double result = entropy.conditionalEntropy(discreteData.getData().get("lwt"),discreteData.getData().get("age"));
    assertEquals(2.47,result,.006);
}

所有内容都正确编译但我非常确定我对条件熵的计算不正确,但我不确定我在哪里犯了错误。

单元测试中的值是我当前获得的值。它们与上述函数的输出相同。

有一次,我也使用以下方法进行测试:

List<String> survived = Arrays.asList("1","0","1","1","0","1","0","0","0","1","0","1","0","0","1");
List<String> sex = Arrays.asList("0","1","0","1","1","0","0","1","1","0","1","0","0","1","1");

男性= 1且幸存者= 1.然后我用它来计算

double result = entropy.entropy(survived);
assertEquals(.996,result,.006);

以及

double jointEntropy = entropy.jointEntropy(survived,sex);
assertEquals(1.99,jointEntropy,.006);

我还通过手工计算值来检查我的工作。您可以看到图片here。因为我的代码给了我与手工完成问题时相同的值,因为其他函数非常简单,只使用了熵/联合熵函数,我认为一切都很好。

然而,出了点问题。下面是我编写的另外两个函数来计算信息增益和集合的对称不确定性。

public double informationGain(List<String> interestedSet, List<String> reducingSet){
    double entropy = entropy(interestedSet);
    double conditionalEntropy = conditionalEntropy(interestedSet,reducingSet);
    double infoGain = entropy - conditionalEntropy;
    return infoGain;
}

public double symmetricalUncertainty(List<String> interestedSet, List<String> reducingSet){
    double infoGain = informationGain(interestedSet,reducingSet);
    double intSet = entropy(interestedSet);
    double redSet = entropy(reducingSet);
    double symUnc = 2 * ( infoGain/ (intSet+redSet) );
    return symUnc;
}

我使用的原始生存/性别集给了我一个稍微消极的答案。但由于它只是负数.000000000000002我只是假设它是一个舍入错误。当我试图运行我的程序时,我得到的对称不确定性的任何值都没有任何意义。

2 个答案:

答案 0 :(得分:1)

tldr;你对H(X,Y)的计算显然假设X和Y是独立的,这导致H(X,Y)= H(X)+ H(Y),这反过来导致你的H(X | Y)等于H(X)。

这是你的问题吗?如果是这样,那么使用正确的X和Y联合熵公式(取自Wikipedia):

enter image description here

通过替换P(X,Y)= P(X)P(Y)得到错误的假设,假设两个变量都是独立的。

如果两个变量 独立,那么确实H(X | Y)= H(X)成立,因为Y并没有告诉你关于X的任何信息,因此知道Y不能&# 39; t降低X的熵。

答案 1 :(得分:0)

要计算单个矢量的熵,可以使用以下函数

Function<List<Double>, Double> entropy = 
    x-> {
        double sum= x.stream().mapToDouble(Double::doubleValue).sum();
        return - x.stream()
                    .map(y->y/sum)
                    .map(y->y*Math.log(y))
                    .mapToDouble(Double::doubleValue)
                    .sum();
    };

作为示例,使用向量[1 2 3]将获得1.0114的结果

double H = new Entropy().entropy.apply(Arrays.asList(new Double[] { 1.0, 2.0, 3.0 }));
System.out.println("Entropy H = "+ H);