Question

我遇到了一个问题，我试图用斯坦福的文本来识别数字名称实体，如果我有例如2000万，它会像这样检索＃34;数字＆＃34 ;：[＆＃34; 20-5＆＃34;，＆＃34;百万-6＆＃34;]，我如何优化答案，以便有两千万人聚在一起？如何忽略上例中的（5,6）索引号？我使用的是java语言。

    public void extractNumbers(String text) throws  IOException {
    number = new HashMap<String, ArrayList<String>>();
    n= new ArrayList<String>();
    edu.stanford.nlp.pipeline.Annotation document = new edu.stanford.nlp.pipeline.Annotation(text);
    pipeline.annotate(document);
    List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
    for (CoreMap sentence : sentences) {
        for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {

            if (!token.get(CoreAnnotations.NamedEntityTagAnnotation.class).equals("O")) {

                if (token.get(CoreAnnotations.NamedEntityTagAnnotation.class).equals("NUMBER")) {
                  n.add(token.toString());
        number.put("Number",n);
                }
            }

        }

    }

Answer 1

要从CoreLabel课程的任何对象获取确切文字，只需使用token.originalText()代替token.toString()

如果您需要这些令牌中的任何其他内容，请查看CoreLabel的{{3}}。

斯坦福的数字名称实体识别

1 个答案: