Entropy of English dictionary

时间:2016-04-04 16:50:53

标签: java algorithm dictionary trie entropy

I have a trie data structure that stores a sequence of English words. For example, given these words, the dictionary is this:

aa abc aids aimed ami amo b browne brownfield brownie browser brut
butcher casa cash cicca ciccio cicelies cicero cigar ciste conv cony
crumply diarca diarchial dort eserine excursus foul gawkishly he
insomniac mehuman occluding poverty pseud rumina skip slagging
socklessness unensured waspily yes yoga z zoo

The nodes in blue are those in which a word is finished.

enter image description here

In each node I saved:

  • the character that it represents
  • the level at which the node is located
  • a counter that indicates how many words "pass" for that node
  • the entropy of the next character of the node.

I would find the entropy for each level of the tree and the total entropy of the dictionary.

This is a piece of class TrieNode that rapresent a single node:

class TrieNode {

    public char content; 
    public boolean isEnd; 
    public double count; 
    public LinkedList<TrieNode> childList; 
    public String path = "";
    public double entropyNextChar;
    public int level;

    ...
 }

And this is a piece of class Trie with some methods to manipulate the trie data structure:

 public class Trie {

    ...

    private double totWords = 0; 
    public double totChar = 0; 

    public double[] levelArray; //in each i position of the array there is the entropy of level i 
    public ArrayList<ArrayList<Double>> level; //contains a list of entropies for each level

    private TrieNode root;

    public Trie(String filenameIn, String filenameOut) {
        root = new TrieNode('*'); //blank for root 
        getRoot().level = 0;
        totWords = 0;
    }

    public double getTotWords() {
        return totWords;
    }

    /** 
     * Function to insert word, setta per ogni nodo il path e il livello.
     */
    public void insert(String word) {
        if(search(word) == true) { 
            return;
        }
        int lev = 0;
        totChar += word.length(); 
        TrieNode current = root;
        current.level = getRoot().level;
        for(char ch : word.toCharArray()) { 
            TrieNode child = current.subNode(ch); 
            if(child != null) {
                child.level = current.level + 1;
                current = child;
            }
            else {
                current.childList.add(new TrieNode(ch, current.path, current.level + 1));
                current = current.subNode(ch);
            }
            current.count++;
        }
        totWords++;
        getRoot().count = totWords;
        current.isEnd = true; 
    }

    /**
     * Function to search for word.
     */
    public boolean search(String word) {
        ...
    }

    /**
     * Set the entropy of each node.
     */
    public void entropyNextChar(TrieNode node) {
        for(TrieNode childToCalculate : node.childList) {
            int numberChildren = node.childList.size();
            int i = 0; 
            double entropy = 0.0;
            if(numberChildren > 0) {
                double[] p = new double[numberChildren]; 
                for(TrieNode child : node.childList) { 
                    p[i] = child.count / node.count;
                    i++;
                }
                for(int j = 0; j < p.length; j++) {
                    entropy -= p[j] * log2(p[j]);
                }
                node.entropyNextChar = entropy;
                entropyNextChar(childToCalculate);
            }
        }
    }

    /**
     * Return the number of levels (root has lev = 0).
     */
    public int getLevels(TrieNode node) {
        int lev = 0;
        if(node != null) {
            TrieNode current = node;
            for(TrieNode child : node.childList) {
                lev = Math.max(lev, 1 + getLevels(child));
            }      
        }   
        return lev;
    }

    public static double log2(double n) {
        return (Math.log(n) / Math.log(2));
    }

    ...
} 

Now I would find the entropy of each level of the tree.

To do this I created the following method that creates two data structures (level and levelArray). levelArray is an array of double containing the result, that is, in the index i there is the entropy of the i-level. level is a arrayList of arrayList. Each list contains the weighted entropy of each node.

public void entropyEachLevel() {
    int num_levels = getLevels(getRoot());
    levelArray = new double[num_levels + 1];
    level = new ArrayList<ArrayList<Double>>(num_levels + 1);
    for(int i = 0; i < num_levels + 1; i++) {
        level.add(new ArrayList<Double>()); 
        levelArray[i] = 0; //inizializzo l'array
    }
    entropyNextChar(getRoot()); 
    fillListArray(getRoot(), level);
    for(int i = 1; i < levelArray.length; i++) {
        for(Double el : level.get(i)) {
            levelArray[i] += el; 
        }
    }
}

public void fillListArray(TrieNode node, ArrayList<ArrayList<Double>> level) {
    for(TrieNode child : node.childList) { 
        double val = child.entropyNextChar * (node.count / totWords);
        level.get(child.level).add(val);
        fillListArray(child, level);
    }
}

If I run this method on the sample dictionary I get:

[lev 1]  10.355154029112995
[lev 2]  0.6557748405420764
[lev 3]  0.2127659574468085
[lev 4]  0.23925771271992619
[lev 5]  0.17744361708265158
[lev 6]  0.0
[lev 7]  0.0
[lev 8]  0.0
[lev 9]  0.0
[lev 10] 0.0
[lev 11] 0.0
[lev 12] 0.0

The problem is that I don't understand if the result is likely or completely wrong.

Another problem: I would like to calculate the entropy of the entire dictionary. To do so I thought I'd add the values present in levelArray. It's rigth a procedure like that? If I do that I obtain that the entropy of the entire dictionary is 11.64.

I need some advice. No wonder code, but I would understand if the solutions I proposed to solve the two problems are corrected.

The example I proposed is very simple. In reality these methods must work on a real English dictionary of about 200800 words. And if I apply these methods in this dictionary I get numbers like these (in my opinion are excessive).

Entropy for each level:

[lev 1]  65.30073504641602
[lev 2]  44.49825655981045
[lev 3]  37.812193162250765
[lev 4]  18.24599038562219
[lev 5]  7.943507700803994
[lev 6]  4.076715421729149
[lev 7]  1.5934893456776191
[lev 8]  0.7510203704630074
[lev 9]  0.33204345165280974
[lev 10] 0.18290941591943546
[lev 11] 0.10260282173581108
[lev 12] 0.056284946780556455
[lev 13] 0.030038717136269627
[lev 14] 0.014766733727532396
[lev 15] 0.007198162552512713
[lev 16] 0.003420610593927708
[lev 17] 0.0013019239303215001
[lev 18] 5.352246905990619E-4
[lev 19] 2.1483959981088307E-4
[lev 20] 8.270156797847352E-5
[lev 21] 7.327868866691726E-5
[lev 22] 2.848394217759738E-6
[lev 23] 6.6648152186416716E-6
[lev 24] 0.0
[lev 25] 8.545182653279214E-6
[lev 26] 0.0
[lev 27] 0.0
[lev 28] 0.0
[lev 29] 0.0
[lev 30] 0.0
[lev 31] 0.0

I think they are wrong. And the entropy of the entire dictionary can not be calculated, I think that the sum it's a process too long so as I'm not able to see the result.

For this I would understand if the methods that I have written are right.

Thanks a lot in advance


In case of dictionary is: aaaab and abcd, I have:

enter image description here

So the counter of node * is 2 because for that node pass two words. The same for the node a at level 1.


I changed method entropyNextChar in this way:

public void entropyNextChar(TrieNode node) {
    for(TrieNode childToCalculate : node.childList) {
        int numberChildren = node.childList.size();
        int i = 0; 
        double entropy = 0.0;
        if(numberChildren > 1) {
            double[] p = new double[numberChildren];
            for(TrieNode child : node.childList) { 
                p[i] = child.count / node.count;
                i++;
            }
            for(int j = 0; j < p.length; j++) {
                entropy -= p[j] * log2(p[j]);
            }
            node.entropyNextChar = entropy;
            entropyNextChar(childToCalculate);
        }
        else {
            node.entropyNextChar = entropy;
            entropyNextChar(childToCalculate);
        }
    }
}

So I add the else.

1 个答案:

答案 0 :(得分:0)

我不清楚你想要得到什么样的熵(单词?,字符?,节点?),所以我假设你的意思是单词的熵,你提到的字典可能包含重复。

如果你想计算trie及其等级中单词的熵,你的概率函数看起来并不合适。考虑一下你的entryNextChar函数:

public void entropyNextChar(TrieNode node) {
    for(TrieNode childToCalculate : node.childList) {
        int numberChildren = node.childList.size();
        int i = 0; 
        double entropy = 0.0;
        if(numberChildren > 0) {
            double[] p = new double[numberChildren]; 
            for(TrieNode child : node.childList) { 
                p[i] = child.count / node.count;
                i++;
            }
            for(int j = 0; j < p.length; j++) {
                entropy -= p[j] * log2(p[j]);
            }
            node.entropyNextChar = entropy;
            entropyNextChar(childToCalculate);
        }
    }
}

特别是我对以下行感兴趣

p[i] = child.count / node.count;

看起来你计算了节点命中的概率(即插入从最顶层字符开始的单词时命中节点的可能性)。如果这就是你真正想要的那么我会说你做对了,虽然这对我没有多大意义。

否则,如果你想计算每个级别内的单词熵,那么你的概率函数是错误的。级别中每个单词的概率应该类似于 count(word_i)/ total_words 其中 count(word_i)是每个单词在级别内发生的次数和 total_words 是级别所持有的单词总数。你有 totWords 计数器,但你错过了单个单词的计数器。我建议您仅在将 isEnd 设置为true时递增 count 字段。在这种情况下,它将指示每个单词在该级别内出现的次数。你的概率函数看起来像:

p[i] = child.count / totWords; // make sure totWords is not 0
  

另一个问题:我想计算整本字典的熵。为此,我想我会添加levelArray中的值。这是一个类似的程序吗?如果我这样做,我会得到整个字典的熵是11.64。

我不认为这是一种正确的方法,因为你的概率分布会发生变化。这次你需要计算整个trie中的单词总数,并将每个单词的频率计数器除以这个数字,以获得正确的概率。