我已按照MALLET示例http://mallet.cs.umass.edu/classifier-devel.php
建立了文档分类分类器我接下来要做的是为每个班级获得最具影响力的功能。我确信这很简单,但我还没能找到如何用Java做到这一点。
感谢任何帮助。
答案 0 :(得分:0)
我遇到了同样的问题。这对我有用。 (不是完全独立的,例如,假设您已经有分类器,以及一些测试数据)
PrintWriter debugOut = new PrintWriter(new File(<filePath>));
InstanceList testInstances = new InstanceList(classifier.getInstancePipe());
CsvIterator reader = new CsvIterator(new FileReader(<path_to_testdata>), \\w+)\\s+(\\w+)\\s+(.*)", 3, 2, 1); // (data, label, name) field indices
testInstances.addThruPipe(reader);
PerLabelInfoGain plig = new PerLabelInfoGain (testInstances);
Alphabet alpha = classifier.getAlphabet();
LabelAlphabet la = classifier.getLabelAlphabet();
debugOut.println("debugging label numbers: " + la.size());
for (int q = 0 ; q < la.size(); q++){
debugOut.println("Class: " + la.lookupLabel(q));
for (int j = 0; j < 10; j++){
int alphaId = plig.getInfoGain(q).getIndexAtRank(j);
Object label = alpha.lookupObject(alphaId);
debugOut.println(j + "\t" + plig.getInfoGain(q).getValueAtRank(i) + "\t" + label);
}
debugOut.println("===============");
}
debugOut.close();
导致:
debugging label numbers: 3
Class: sexism
0 0.1257616291393775 sexist
1 0.1257616291393775 rt
2 0.1257616291393775 female
3 0.1257616291393775 notsexist
4 0.1257616291393775 m
5 0.1257616291393775 women
6 0.1257616291393775 mt8_9
7 0.1257616291393775 sports
8 0.1257616291393775 islam
9 0.1257616291393775 men
===============
Class: none
0 0.09383300761779656 sexist
1 0.09383300761779656 mkr
2 0.09383300761779656 female
3 0.09383300761779656 muslims
4 0.09383300761779656 rt
5 0.09383300761779656 notsexist
6 0.09383300761779656 women
7 0.09383300761779656 islam
8 0.09383300761779656 mt8_9
9 0.09383300761779656 mohammed
===============
Class: racism
0 0.062072998255453926 islam
1 0.062072998255453926 muslims
2 0.062072998255453926 mkr
3 0.062072998255453926 mohammed
4 0.062072998255453926 muslim
5 0.062072998255453926 maxblumenthal
6 0.062072998255453926 quran
7 0.062072998255453926 years
8 0.062072998255453926 prophet
9 0.062072998255453926 1400
===============
编辑:plig.getInfoGain(q).getValueAtRank( i )显然应该是plig.getInfoGain(q).getValueAtRank( j )