我调整了来自here的Mannings教授代码示例来读取文件,标记化,词性标记,并将其解释。
现在我遇到了无法识别字符的问题,我想使用"无法识别的"选项并将其设置为" noneKeep"。
StackOverflow上的其他问题解释说我需要自己实例化tokenizer。但是,我不知道如何做到这一点,以便仍然根据需要执行以下任务(POS标记等)。有人能指出我正确的方向吗?
// expects two command line parameters: one file to be read, one to write to
import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;
public class StanfordCoreNlpDemo {
public static void main(String[] args) throws IOException {
PrintWriter out;
out = new PrintWriter(args[1]);
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation;
annotation = new Annotation(IOUtils.slurpFileNoExceptions(args[0]));
pipeline.annotate(annotation);
pipeline.prettyPrint(annotation, out);
}
}
答案 0 :(得分:4)
将此添加到您的代码中:
props.setProperty("tokenize.options", "untokenizable=allKeep");
无法辨认的6个选项是:
noneDelete, firstDelete, allDelete, noneKeep, firstKeep, allKeep