Question

当我试图标记化时，我刚刚从StanfordNLP的内脏中发现一个奇怪的例外：

java.lang.NullPointerException at edu.stanford.nlp.process.PTBLexer.zzRefill（PTBLexer.java:24511）at at edu.stanford.nlp.process.PTBLexer.next(PTBLexer.java:24718）at at edu.stanford.nlp.process.PTBTokenizer.getNext（PTBTokenizer.java:276）在 edu.stanford.nlp.process.PTBTokenizer.getNext（PTBTokenizer.java:163）在 edu.stanford.nlp.process.AbstractTokenizer.hasNext（AbstractTokenizer.java:55）在 edu.stanford.nlp.process.DocumentPreprocessor $ PlainTextIterator.primeNext（DocumentPreprocessor.java:270）在 edu.stanford.nlp.process.DocumentPreprocessor $ PlainTextIterator.hasNext（DocumentPreprocessor.java:334）

导致它的代码如下所示：

  DocumentPreprocessor dp = new DocumentPreprocessor(new StringReader(
            tweet));

    // unigrams
    for (List<HasWord> sentence : dp) {
        for (HasWord word : sentence) {
            // do stuff
        }
    }

    // bigrams
    for (List<HasWord> sentence : dp) { //<< exception is thrown here
        Iterator<HasWord> it = sentence.iterator();
        String st1 = it.next().word();
        while (it.hasNext()) {
            String st2 = it.next().word();
            String bigram = st1 + " " + st2;
            // do stuff
            st1 = st2;
        }
    }

发生了什么事？这与我循环两次令牌有关吗？

Answer 1

这肯定是一个丑陋的堆栈跟踪，可以而且应该改进。（我即将检查修复程序。）但是这不起作用的原因是DocumentProcessor的作用类似于Reader：它只允许您通过文档的句子进行单次传递。因此，在第一个for循环之后，文档已用尽，并且基础Reader已关闭。因此第二个for循环失败了，这里在词法分析器深处崩溃了。我要改变它，以便它不会给你任何东西。但是为了得到你想要的东西，你要么（最有效率）在一个for循环中通过文档获取unigrams和bigrams，或者为第二次传递创建第二个DocumentPreprocessor。

Answer 2

我认为it.next().word()导致了它。

更改您的代码，以便首先检查it.hasNext()是否it.next().word()然后执行{{1}}。

Stanford Tokenizer NullPointerException

2 个答案: