Question

我正在使用Stanford CoreNLP进行提取。下面是我试图提取货币和货币符号的句子

2015年3月5日Kering发行€500,000,000 0.875％

我需要提取的数据是€500,000,000 0.875

NLP默认情况下给出的句子为

2015年3月5日Kering发行** $ ** 500,000,000 0.875％

所以我写了

public static readonly TokenizerFactory TokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(),
            "normalizeCurrency=false");
DocumentPreprocessor docPre = new DocumentPreprocessor(new java.io.StringReader(textChunk));
docPre.setTokenizerFactory(TokenizerFactory);

现在这句话正好成为

2015年3月5日Kering发行€500,000,000 0.875％

但是当我做的时候

props.put("annotators", "tokenize, cleanxml, ssplit, pos, lemma, ner, regexner");
props.setProperty("ner.useSUTime", "0");
_pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(text);
_pipeline.annotate(document);

其中text = 2015年3月5日Kering发行€500,000,000 0.875％

我的输出为

<token id="9">
   <word>$</word>
   <lemma></lemma>
   <CharacterOffsetBegin>48</CharacterOffsetBegin>
   <CharacterOffsetEnd>49</CharacterOffsetEnd>
   <POS>CD</POS>
   <NER>MONEY</NER>
   <NormalizedNER>$5.000000000875E9</NormalizedNER>
</token>

所以我添加了一行props.put("tokenize.options", "normalizeCurrency=false"); 但输出仍与 $ 5.000000000875E9

相同

任何人都可以帮助我。谢谢

Answer 1

当我运行此代码时，它没有将货币符号更改为＆＃34; $＆＃34;：

package edu.stanford.nlp.examples;

import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;

import java.util.*;

public class TokenizeOptionsExample {

  public static void main(String[] args) {
    Annotation document = new Annotation("5 March 2015 Kering Issue of €500,000,000 0.875 per cent");
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit");
    props.setProperty("tokenize.options", "normalizeCurrency=false");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    pipeline.annotate(document);
    for (CoreLabel token : document.get(CoreAnnotations.TokensAnnotation.class)) {
      System.out.println(token);
    }
  }
}

规范化斯坦福大学的货币NLP无法按预期工作

1 个答案: