Question

我正在尝试为多个实体培训自定义NER模型。以下是示例培训数据：

count all <START:item_type> operating tables <END> on the <START:location_id> third <END> <START:location_type> floor <END>
count all <START:item_type> items <END> on the <START:location_id> third <END> <START:location_type> floor <END>
how many <START:item_type> beds <END> are in <START:location_type> room <END> <START:location_id> 2 <END>

NameFinderME.train(.)方法采用字符串参数type。这个参数有什么用？而且，我如何为多个实体训练模型（例如item_type，location_type，location_id在我的情况下）

public static void main(String[] args) {
    String trainingDataFile = "/home/OpenNLPTest/lib/training_data.txt";
    String outputModelFile = "/tmp/model.bin";
    String sentence = "how many beds are in the hospital";

    train(trainingDataFile, outputModelFile, "location_type");
    predict(sentence, outputModelFile);
}

private static void train(String trainingDataFile, String outputModelFile, String tagToFind) {
    File inFile = new File(trainingDataFile);
    NameSampleDataStream nss = null;
    try {
        nss = new NameSampleDataStream(new PlainTextByLineStream(new java.io.FileReader(inFile)));
    } catch (Exception e) {}

    TokenNameFinderModel model = null;
    int iterations = 100;
    int cutoff = 5;
    try {
        // Does the 'type' parameter mean the entity type that I am trying to train the model for?
        // What if I need to train for multiple entities?
        model = NameFinderME.train("en", tagToFind, nss, (AdaptiveFeatureGenerator) null, Collections.<String,Object>emptyMap(), iterations, cutoff); 
    } catch(Exception e) {}

    try {
        File outFile = new File(outputModelFile);           
        FileOutputStream outFileStream = new FileOutputStream(outFile);
        model.serialize(outFileStream);
    }
    catch (Exception ex) {}
}

private static void predict(String sentence, String modelFile) throws Exception {
    FileInputStream modelInToken = new FileInputStream("/tmp/en-token.bin");
    TokenizerModel modelToken = new TokenizerModel(modelInToken);
    Tokenizer tokenizer = new TokenizerME(modelToken); 
    String tokens[] = tokenizer.tokenize(sentence);

    FileInputStream modelIn = new FileInputStream(modelFile);

    TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
    NameFinderME nameFinder = new NameFinderME(model);
    Span nameSpans[] = nameFinder.find(tokens);

    double[] spanProbs = nameFinder.probs(nameSpans);

    for( int i = 0; i<nameSpans.length; i++) {
        System.out.println(nameSpans[i]);
    }

}

Answer 1

type的{{1}}参数用作训练不包含类型参数的数据的默认类型。仅当您有一个如下所示的样本时，这才有意义：

NameFinderME.train

而不是这样：

<START> operating tables <END>

要培训多种类型的实体，开发人员文档说

培训文件可以包含多种类型。如果是培训档案包含多个类型，创建的模型也将能够检测这些多种类型。目前它建议只训练单身类型模型，因为多类型支持仍然是实验性的。

因此，您可以尝试对您的问题中的示例进行培训，其中包括多种类型，并了解它的工作原理。在this mailing list message中，有人要求提供多种类型的培训状态并获得此答案：

代码路径本身是稳定的，我们把它放在那里的原因就是它   英语数据表现不佳。

无论如何，性能可能在很大程度上取决于您的数据集和   语言。

如果使用处理多种类型的模型无法获得良好的性能，则可以选择创建训练数据的多个副本，其中每个副本都被修改为仅包含一种类型。然后，您将在每组训练数据上训练一个单独的模型。此时，您应该拥有（例如） item_type 模型， location_type 模型和 location_id 模型。然后，您可以通过每个模型运行输入以检测不同的类型。

OpenNLP：为多个实体培训自定义NER模型

1 个答案: