我使用Cleartk(V. 2.0)简单管道为CAS中的单个句子开发二元分类器。但是,即使生成了训练数据,分类器也不会在训练期间进行提取,见下文。
我正在处理this example,特别是此代码段:
AnalysisEngineFactory.createPrimitiveDescription(
<name-of-your-cleartk-annotator>.class,
CleartkAnnotator.PARAM_IS_TRAINING, true,
DirectoryDataWriterFactory.PARAM_OUTPUT_DIRECTORY,
<your-output-directory-file>,
DefaultSequenceDataWriterFactory.PARAM_DATA_WRITER_CLASS_NAME,
<name-of-your-selected-classifier's-data-writer>.class);
所以我的初始化代码如下所示:
AnalysisEngine trainClassifier = AnalysisEngineFactory.createPrimitive(MyClassifier.class,
CleartkAnnotator.PARAM_IS_TRAINING, true,
DirectoryDataWriterFactory.PARAM_OUTPUT_DIRECTORY, "target/classifier-data/",
DefaultSequenceDataWriterFactory.PARAM_DATA_WRITER_CLASS_NAME, MalletCrfStringOutcomeDataWriter.class.getName());
当我运行我的管道时,数据被创建并存储到target/classifier-data/training-data.malletcrf
,其中每一行都是一个特征向量,其中各个条目的格式为<featurename>_<value>
和我的布尔目标属性。我可以在文本编辑器中打开它并查看它。
我正在使用String结果分类器,因为我的目标变量注释器继承自CleartkSequenceAnnotator
,并且正如我从先前对Cleartk列表的回答中所理解的那样,似乎没有一个能够使用的布尔分类器每个CAS的多个分类任务。
我粗略的分类器代码:
public class MyClassifier extends CleartkSequenceAnnotator<String> {
@Override
public void process(JCas jCas) throws AnalysisEngineProcessException {
// retrieve sentences in the cas
for (Sentence sentence : sentences) {
// apply feature extractors here to add features
// add target variable
}
if (this.isTraining()) {
// write the features and outcomes as training instances
this.dataWriter.write(Instances.toInstances(targets, featureLists));
try {
System.out.println("training the classifier ... ");
Train.main("target/classifier-data/");
System.out.println("done training classifier");
} catch (Exception e) {
System.out.println("ERROR while training the classifier.");
e.printStackTrace();
}
} else /* Classification */ {...}
}
以下是管道代码:
SimplePipeline.runPipeline(reader,
trainClassifier,
XmiWriter);
当我运行管道时,即使已经编写了训练数据,我也会获得以下控制台输出:
... reader initialization ...
Couldn't open cc.mallet.util.MalletLogger resources/logging.properties file.
Perhaps the 'resources' directories weren't copied into the 'class' directory.
Continuing.
starting pipeline
training the classifier ...
Okt 02, 2014 11:19:48 PM cc.mallet.fst.SimpleTagger main
INFORMATION: Number of features in training data: 0
Okt 02, 2014 11:19:48 PM cc.mallet.fst.SimpleTagger main
INFORMATION: Number of predicates: 0
Okt 02, 2014 11:19:48 PM cc.mallet.fst.SimpleTagger main
INFORMATION: Labels: O
Okt 02, 2014 11:19:48 PM cc.mallet.fst.CRF addOrderNStates
INFORMATION: Preparing O
Okt 02, 2014 11:19:48 PM cc.mallet.fst.CRF addOrderNStates
INFORMATION: O->O(O) O,O
State #0 "O"
initialWeight=0.0, finalWeight=0.0
#destinations=1
-> O
Okt 02, 2014 11:19:48 PM cc.mallet.fst.SimpleTagger train
INFORMATION: Training on 0 instances
Okt 02, 2014 11:19:48 PM cc.mallet.fst.CRF setWeightsDimensionAsIn
INFORMATION: CRF weights[O,O] num features = 0
Okt 02, 2014 11:19:48 PM cc.mallet.fst.CRF setWeightsDimensionAsIn
INFORMATION: Number of weights = 1
Okt 02, 2014 11:19:48 PM cc.mallet.fst.CRFTrainerByLabelLikelihood train
INFORMATION: CRF about to train with 1 iterations
Okt 02, 2014 11:19:48 PM cc.mallet.fst.CRFOptimizableByLabelLikelihood getValue
INFORMATION: getValue() (loglikelihood, optimizable by label likelihood) = 0.0
Okt 02, 2014 11:19:48 PM cc.mallet.optimize.LimitedMemoryBFGS optimize
INFORMATION: L-BFGS initial gradient is zero; saying converged
Okt 02, 2014 11:19:48 PM cc.mallet.fst.CRFTrainerByLabelLikelihood train
INFORMATION: CRF finished one iteration of maximizer, i=0
Okt 02, 2014 11:19:48 PM cc.mallet.fst.CRFTrainerByLabelLikelihood train
INFORMATION: CRF training has converged, i=0
done training classifier
...这告诉我分类器不知道从文件中获取训练数据。
我做错了什么?提前谢谢!
答案 0 :(得分:0)
我的猜测是你输入了错误的Sentence类。您可以通过在过程 - MyClassifier 的方法中调试for循环来轻松找出我是否正确。