Question

在我的应用程序中，我将相对较大的文档标记为句子。我有两种不同的方法将文档分成句子 - 一种基于Simple CoreNLP Api，另一种基于天真的正则表达式。

基于CoreNLP的方法：

public class CoreNLPSentenceTokenizer implements ITokenizer, Serializable {

    static final long serialVersionUID = 1L;

    @Override
    public List<String> getTokens(String s) {
        Document document = new Document(s);
        return document.sentences().stream().map(sent -> sent.text()).collect(Collectors.toList());
    }
}

基于天真正则表达式的方法：

public class SentenceTokenizer implements ITokenizer {

    @Override
    public List<String> getTokens(String content) {
        return Arrays.asList(
                content.split("(\\.|\\?|\\!)"));
    }
}

过了一段时间，当使用基于CoreNLP的标记生成器时，我得到OutOfMemoryError Exception。我决定将VisualVM附加到我的应用程序中以查看正在进行的操作，结果是：

和edu.standford.nlp.pipeline.CoreNLPProtos $ Token $ Builder和edu.standford.npl.pipeline.CorenLPProtos $ Token的大量内存分配。

然后我用前面提到的naive tokenizer（我修改过的代码的唯一部分）替换它并得到了这些结果：

这更像是我所期待的，因为我使用句子来计算索引的哈希值并在之后丢弃它们。基于此tokenizer的代码已经运行了大约18个小时，没有内存不足异常，堆看起来像这样：

实心区域是基于哈希的增长索引（如预期的那样），并且峰值很可能是为哈希计算分配的句子和其他临时对象。

我不想放弃CoreNLP，因为它比regexp tokenizer提供了更好的结果。

Answer 1

你可以试试这个而不使用简单的界面吗？这应该摆脱内存泄漏。确保您的代码仅在构建标记生成器时构建管道。

此外，Stanford CoreNLP 3.9.0目前处于测试阶段，我们添加了一些新语法，以便更轻松地使用传统的管道接口。您可以从我们的网站下载3.9.0或从GitHub构建。

尽管3.9.0仍然有点不稳定，我们很快就会推出一个新版本。

从这个传统代码开始，看看你的内存泄漏是否消失：

import edu.stanford.nlp.ling.*:
import edu.stanford.nlp.pipeline.*;
import java.util.*;

// build the pipeline outside of your tokenize method when you
// initialize your tokenizer
Properties props = new Properties();
props.setProperty("annotators", "tokenize");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// use the pipeline when you want to tokenize
Annotation annotationToTokenize = new Annotation(s);
pipeline.annotate(annotationToTokenize);
List<CoreLabel> tokens = annotationToTokenize.get(CoreAnnotations.TokensAnnotation.class);
return tokens.stream().map(token -> token.word()).collect(Collectors.toList());

Standford CoreNLP内存泄漏

1 个答案: