Question

在我的情况下。

我有一个处理文本的TextProcessor类。我需要在这样的文本中找到共同点，然后使用斯坦福的OpenIE工具提取信息。我使用这两个管道：

＆＃34; tokenize，ssplit，pos，lemma，ner，parse，提及，coref＆＃34; for coreferences。

和

用于信息提取的
＆＃34; tokenize，ssplit，pos，lemma，depparse，natlog，openie＆＃34; 。

分析单个文本需要花费大量时间来分析单个文本，但目前我必须这样做，因为一起使用它们需要大量内存，管道会占用我的记忆界限。

public class TextProcessor(){
    Properties props;
    StanfordCoreNLP pipeline;

    public TextProcessor() {
        props = new Properties();
        props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,mention,coref");
        pipeline = new StanfordCoreNLP(props);
    }


    // Performs NER and COREF 
     public void process(String text) {
         Annotation document = new Annotation(malware.getDescription());
         pipeline.annotate(document);

         // Process text (tokenization, pos, lemma, ner, coref)....
     }

     public void extractInformation(String document) {
         props = new Properties();
         props.setProperty("annotators", "tokenize,ssplit,pos,lemma,depparse,natlog,openie");
         pipeline = new StanfordCoreNLP(props);

         Annotation doc = new Annotation(document);
         pipeline.annotate(doc);

         // Extract informations from doc ...
    }

有没有办法动态组合两个管道？我的意思是，像这样：

1）＆＃34; tokenize，ssplit，pos，lemma，ner，depparse，提及，coref ＆＃34;

2）＆＃34; tokenize，ssplit，pos，lemma，ner，depparse，提，coref， natlog，openie ＆＃34;。

我尝试从第一个方法process(String text)返回一个Annotation对象，然后在方法extractInformation(String text)中将其他三个属性添加到它中，如下所示：

     public Annotation process(String text) {
         Annotation document = new Annotation(malware.getDescription());
         pipeline.annotate(document);

         // Process text (tokenization, pos, lemma, ner, coref)....
         return document;
     }

     public void extractInformation(Annotation document) {
         props.setProperty("annotators","depparse,natlog,openie");
         pipeline = new StanfordCoreNLP(props);
         pipeline.annotate(document);

         // Extract informations from doc ...
    }

但是我收到了这个错误：

annotator "depparse" requires annotation "TextAnnotation". The usual requirements for this annotator are: tokenize,ssplit,pos。

我认为将新的三个属性（depparse，natlog，openie）添加到已经注释的文档（使用tokenize，ssplit，pos）会起作用，但它没有。

那么，有没有办法将这些属性添加到最旧的管道中，避免再次执行所有管道（加上新属性）并避免内存超出其边界？

的更新

我需要做的只是

     public Annotation process(String text) {
         Annotation document = new Annotation(malware.getDescription());
         pipeline.annotate(document);

         // Process text (tokenization, pos, lemma, ner, coref)....
         StanfordCoreNLP.clearAnnotatorPool(); // <-- Added: to get rid of the models and solve the memory issue
         return document;
     }

     public void extractInformation(Annotation document) {
         props.setProperty("annotators","natlog,openie");

         props.setProperty("enforceRequirements", "false") //<-- Added

         pipeline = new StanfordCoreNLP(props);
         pipeline.annotate(document);

         // Extract informations from doc ...
    }

或者，您可以使用：

pipeline = new StanfordCoreNLP(props, false);

在extractInformation（注释文档）中。

Answer 1

听起来你想构建第一个管道，在一组文件上运行它，清除内存，然后构建第二个管道并在文档集上运行它。

如果在同一组Annotations上运行第二个管道，它将只选择第一个管道完成的位置。但是您需要将enforceRequirements设置为false，以便第二个管道不会崩溃。在使用完第一个管道后，你应该运行StanfordCoreNLP.clearAnnotatorPool();来摆脱模型，否则你将无法解决内存问题。

向StanfordCoreNLP Annotator或Pipeline动态添加属性

1 个答案: