作为理解stanford nlp api进行分类的一部分,我正在一个非常简单的训练集上训练天真的贝叶斯分类器(3个标签=> [' happy',' sad&#39 ;,'中性'])。该训练数据集是
happy happy
happy glad
sad gloomy
neutral fine
这是训练分类器(错误之前)
的输出的一部分numDatumsPerLabel: {happy=2.0, sad=1.0, neutral=1.0}
numLabels: 3 [happy, sad, neutral]
numFeatures (Phi(X) types): 4 [1-SW-happy, 1-SW-glad, 1-SW-gloomy, 1-SW-fine]
我得到一个数组索引越界错误。我附加了堆栈跟踪。我无法找到问题。
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainWeightsJL(NaiveBayesClassifierFactory.java:171)
at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainWeights(NaiveBayesClassifierFactory.java:146)
at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainClassifier(NaiveBayesClassifierFactory.java:84)
at edu.stanford.nlp.classify.NaiveBayesClassifierFactory.trainClassifier(NaiveBayesClassifierFactory.java:352)
at edu.stanford.nlp.classify.ColumnDataClassifier.makeClassifier(ColumnDataClassifier.java:1458)
at edu.stanford.nlp.classify.ColumnDataClassifier.trainClassifier(ColumnDataClassifier.java:2091)
at edu.stanford.nlp.classify.demo.ClassifierDemo.main(ClassifierDemo.java:35)
作为获取权重的一部分
private NBWeights trainWeightsJL(int[][] data, int[] labels, int numFeatures, int numClasses) {
int[] numValues = numberValues(data, numFeatures);
double[] priors = new double[numClasses];
double[][][] weights = new double[numClasses][numFeatures][];
//init weights array
for (int cl = 0; cl < numClasses; cl++) {
for (int fno = 0; fno < numFeatures; fno++) {
weights[cl][fno] = new double[numValues[fno]];
// weights[cl][fno] = new double[numFeatures];
}
}
for (int i = 0; i < data.length; i++) {
priors[labels[i]]++;
for (int fno = 0; fno < numFeatures; fno++) {
weights[labels[i]][fno][data[i][fno]]++;
}
}
for (int cl = 0; cl < numClasses; cl++) {
for (int fno = 0; fno < numFeatures; fno++) {
for (int val = 0; val < numValues[fno]; val++) {
weights[cl][fno][val] = Math.log((weights[cl][fno][val] + alphaFeature) / (priors[cl] + alphaFeature * numValues[fno]));
}
}
priors[cl] = Math.log((priors[cl] + alphaClass) / (data.length + alphaClass * numClasses));
}
return new NBWeights(priors, weights);
}
我无法理解
int[] numValues = numberValues(data, numFeatures);
装置。错误来自
行weights[labels[i]][fno][data[i][fno]]++;
我原本认为权重是一个二维数组来跟踪不同类(标签)的特征(fno)出现。不确定为什么需要第三个维度。
非常感谢任何帮助。
答案 0 :(得分:0)
我对这些属性没有任何问题:
#
# Features
#
useClassFeature=true
1.useNGrams=true
1.usePrefixSuffixNGrams=true
1.maxNGramLeng=4
1.minNGramLeng=1
1.binnedLengths=10,20,30
#
# Printing
#
# printClassifier=HighWeight
printClassifierParam=200
#
# Mapping
#
goldAnswerColumn=0
displayedColumn=1
#
# Optimization
#
intern=true
sigma=3
useQN=true
QNsize=15
tolerance=1e-4
useNB=true
useClass=true
#
# Training input
#
trainFile=simple-classifier-training-set.txt
serializeTo=model.txt
运行此命令:
java -Xmx8g edu.stanford.nlp.classify.ColumnDataClassifier -prop example.prop