在使用Stanford解析器的TokenizerFacotry时,我确保将选项设置为" untokenizable = noneDelete"我还是设法得不到警告,可能是什么问题?
public static List<Tree> findHeadNounPhrases(List<String> unites)
{
List<Tree> nps = new ArrayList<Tree>();
for(String sentence : unites)
{
HeadFinder hf = new PennTreebankLanguagePack().headFinder();
StringReader reader = new StringReader(sentence);
TokenizerFactory<CoreLabel> tokenizerFactory =
PTBTokenizer.factory(new CoreLabelTokenFactory(), "untokenizable=noneDelete");
tokenizerFactory.setOptions("untokenizable=noneDelete");
Tokenizer<CoreLabel> tok =tokenizerFactory.getTokenizer(reader);
List<CoreLabel> rawWords2 = tok.tokenize();
Tree tree = lp.apply(rawWords2);
...
}
我收到以下警告:
Mar 10, 2016 11:13:51 AM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ି (U+B3F, decimal: 2879)
Mar 10, 2016 11:13:51 AM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: ି (U+B3F, decimal: 2879)
Mar 10, 2016 11:13:56 AM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: (U+89, decimal: 137)