Question

我调整了来自here的Mannings教授代码示例来读取文件，标记化，词性标记，并将其解释。

现在我遇到了无法识别字符的问题，我想使用＆＃34;无法识别的＆＃34;选项并将其设置为＆＃34; noneKeep＆＃34;。

StackOverflow上的其他问题解释说我需要自己实例化tokenizer。但是，我不知道如何做到这一点，以便仍然根据需要执行以下任务（POS标记等）。有人能指出我正确的方向吗？

// expects two command line parameters: one file to be read, one to write to

import java.io.*;
import java.util.*;

import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;

public class StanfordCoreNlpDemo {

  public static void main(String[] args) throws IOException {
    PrintWriter out;
    out = new PrintWriter(args[1]);

    Properties props = new Properties();
    props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation annotation;
    annotation = new Annotation(IOUtils.slurpFileNoExceptions(args[0]));

    pipeline.annotate(annotation);
    pipeline.prettyPrint(annotation, out);
  }
}

Answer 1

将此添加到您的代码中：

props.setProperty("tokenize.options", "untokenizable=allKeep");

无法辨认的6个选项是：

noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep

在Stanford CoreNLP tokenizer

1 个答案: