Question

我是NLP的新手，我使用Stanford NER工具对一些随机文本进行分类，以提取软件编程中使用的特殊关键字。

问题是，我不知道如何对Stanford NER中的分类器和文本注释器进行更改以识别软件编程关键字。例如：

today Java used in different operating systems (Windows, Linux, ..)

分类结果应如：

Java "Programming_Language"
Windows "Operating_System"
Linux "Operating_system"

请您帮忙了解如何自定义StanfordNER分类器以满足我的需求？

Answer 1

我认为在斯坦福NER常见问题解答http://nlp.stanford.edu/software/crf-faq.shtml#a中有很好的记录。

以下是步骤：

在您的属性文件中，更改地图以指定训练数据的注释方式（或结构化的）

map = word = 0，myfeature = 1，answer = 2

在src\edu\stanford\nlp\sequences\SeqClassifierFlags.java
中
添加一个标志，表明您要使用新功能，我们称之为useMyFeature 在public boolean useLabelSource = false下方，添加 public boolean useMyFeature = true;

在setProperties(Properties props, boolean printProps)方法之后的同一文件中 else if (key.equalsIgnoreCase("useTrainLexicon")) { ..}告诉工具，如果此标志为您打开/关闭
```
else if (key.equalsIgnoreCase("useMyFeature")) {
      useMyFeature= Boolean.parseBoolean(val);
}
```

在src/edu/stanford/nlp/ling/CoreAnnotations.java中，添加以下内容节

public static class myfeature implements CoreAnnotation<String> {
  public Class<String> getType() {
    return String.class;
  }
}

在src/edu/stanford/nlp/ling/AnnotationLookup.java中底部public enumKeyLookup{..}添加

MY_TAG（CoreAnnotations.myfeature.class， “我的功能”）

在src\edu\stanford\nlp\ie\NERFeatureFactory.java中，视情况而定 “类型”的功能，添加

protected Collection<String> featuresC(PaddedList<IN> cInfo, int loc)

if(flags.useRahulPOSTAGS){
    featuresC.add(c.get(CoreAnnotations.myfeature.class)+"-my_tag");
}

调试：除此之外，还有一些方法可以将功能转储到文件中，使用它们来查看事情是如何完成的。另外，我认为你也需要花一些时间使用调试器：P

Answer 2

似乎您想训练您的自定义NER模型。

以下是完整代码的详细教程：

https://dataturks.com/blog/stanford-core-nlp-ner-training-java-example.php?s=so

培训数据格式

训练数据作为文本文件传递，其中每一行是一个单词 - 标签对。行中的每个单词都应以“word \ tLABEL”之类的格式标记，单词和标签名称由制表符'\ t'分隔。对于文本句子，我们应该将其分解为单词，并为训练文件中的每个单词添加一行。要标记下一行的开头，我们在训练文件中添加一个空行。

以下是输入培训文件的示例：

hp  Brand
spectre ModelName
x360    ModelName

home    Category
theater Category
system  0

horizon ModelName
zero    ModelName
dawn    ModelName
ps4 0

根据您的域，您可以自动或手动构建此类数据集。手动构建这样的数据集可能非常痛苦，像NER annotation tool这样的工具可以帮助简化流程。

训练模型

public void trainAndWrite(String modelOutPath, String prop, String trainingFilepath) {
   Properties props = StringUtils.propFileToProperties(prop);
   props.setProperty("serializeTo", modelOutPath);

   //if input use that, else use from properties file.
   if (trainingFilepath != null) {
       props.setProperty("trainFile", trainingFilepath);
   }

   SeqClassifierFlags flags = new SeqClassifierFlags(props);
   CRFClassifier<CoreLabel> crf = new CRFClassifier<>(flags);
   crf.train();

   crf.serializeClassifier(modelOutPath);
}

使用模型生成标签：

public void doTagging(CRFClassifier model, String input) {
    input = input.trim();
    System.out.println(input + "=>"  +  model.classifyToString(input));
}

希望这有帮助。

Stanford-NER定制，用于对软件编程关键字进行分类

2 个答案: