Question

我该怎样做？

我想要做的是加载Stanford NLP ONCE，然后通过HTTP或其他端点与它进行交互。原因是它需要很长时间才能加载，并且为每个字符串加载进行分析是不可能的。

例如，这里是斯坦福NLP加载一个简单的C＃程序加载罐...我想做我在下面做的，但在java：

    Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [9.3 sec]. 
    Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.all.3class.distsim.crf.ser.gz ... done [12.8 sec]. 
    Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.muc.7class.distsim.crf.ser.gz ... done [5.9 sec]. 
    Loading classifier from D:\Repositories\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models\edu\stanford\nlp\models\ner\english.conll.4class.distsim.crf.ser.gz ...  done [4.1 sec]. 
done [8.8 sec]. 

Sentence #1 ...

超过30秒。如果这些都必须加载每次，yikes。为了展示我想在java中做什么，我在C＃中编写了一个工作示例，这个完整的例子可能有一天会帮助某人：

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

using System.IO;
using java.io;
using java.util;
using edu.stanford.nlp;
using edu.stanford.nlp.pipeline;
using Console = System.Console;

namespace NLPConsoleApplication
{
    class Program
    {
        static void Main(string[] args)
        {
            // Path to the folder with models extracted from `stanford-corenlp-3.6.0-models.jar`
            var jarRoot = @"..\..\..\..\StanfordNLPCoreNLP\stanford-corenlp-3.6.0-models";
            // Text for intial run processing
            var text = "Kosgi Santosh sent an email to Stanford University. He didn't get a reply.";
            // Annotation pipeline configuration
            var props = new Properties();
            props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, sentiment"); 
            props.setProperty("ner.useSUTime", "0");
            // We should change current directory, so StanfordCoreNLP could find all the model files automatically
            var curDir = Environment.CurrentDirectory;
            Directory.SetCurrentDirectory(jarRoot);
            var pipeline = new StanfordCoreNLP(props);
            Directory.SetCurrentDirectory(curDir);
            // loop
            while (text != "quit")
            {
                // Annotation
                var annotation = new Annotation(text);
                pipeline.annotate(annotation);
                // Result - Pretty Print
                using (var stream = new ByteArrayOutputStream())
                {
                    pipeline.prettyPrint(annotation, new PrintWriter(stream));
                    Console.WriteLine(stream.toString());
                    stream.close();
                }
                edu.stanford.nlp.trees.TreePrint tprint = new edu.stanford.nlp.trees.TreePrint("words");
                Console.WriteLine();
                Console.WriteLine("Enter a sentence to evaluate, and hit ENTER (enter \"quit\" to quit)");
                text = Console.ReadLine();
            } // end while
        }
    }
}

因此加载需要30秒，但每次在控制台上给它一个字符串时，需要花费一小部分时间来解析和tokenize that string。

你可以看到我在while循环之前加载了jar文件。

这可能最终成为套接字服务，HTML或其他会接受请求（以字符串形式）的东西，并吐出解析。

我的最终目标是在Nifi中使用一种机制，通过可以发送要解析的字符串的处理器，并在不到一秒的时间内返回它们，而传统的Web服务器线程示例（例如）则返回30+秒用来。每个请求都会加载整个东西30秒，然后开始工作。我希望我明白这一点！

怎么做？

Answer 1

您列出的任何机制都是向Apache NiFi利用该服务的完美合理路线。根据您的需要，与标准版NiFi捆绑在一起的某些处理器和扩展可能足以与您提议的Web服务或类似产品进行交互。

如果您正在努力在NiFi内部执行所有这些操作，那么自定义Controller Service可能是向NiFi提供此资源的绝佳途径，该资源属于应用程序本身的生命周期。

NiFi可以通过控制器服务和自定义处理器等项目进行扩展，我们有一些documentation可以让您从这条路径开始。

其他细节肯定有助于提供更多信息。请随时通过我们的mailing lists跟进其他评论和/或与社区联系。

如果不清楚NiFi是由JVM驱动的，并且工作将使用Java或JVM友好语言完成，我确实想要提出一个问题。

Answer 2

您应该查看Stanford NLP在3.6.0版中引入的the new CoreNLP Server。它似乎只是你想要的？其他一些人如ETS也做过类似的事情。

好的一点：如果大量使用它，你可能（目前）想从github HEAD获取最新的CoreNLP代码，因为它包含一些服务器的修复程序，它将在下一个版本中。

与大型java程序作为服务进行交互？

2 个答案: