Stanford Parser - MultiThreading问题 - LexicalizedParser

时间:2017-01-01 12:33:52

标签: java multithreading nlp stanford-nlp

首先,解析在一小组句子上运行顺畅 - 按200ms到1s的顺序 - 取决于句子大小。

我想达到什么目的?

我想在1-2小时内解析50L句子。

不知何故,我需要转换这个 - >

            for(String sentence: sentences){
               Tree parsed = AnalysisUtilities.getInstance().parseSentence(job).parse;
            }

进入多线程调用。 我写了一个多线程执行器来做这个,看起来像这样 - >

                MultiThreadExecutor<String> mte = new MultiThreadExecutor<String>(2, new JobExecutor<String>() {
                @Override
                public void executeJob(String job) {
                    Tree parsed = AnalysisUtilities.getInstance().parseSentence(job).parse;
                    inputTrees.add(parsed);
                }
            }, "");


            for(String sentence: sentences){
                mte.addJob(sentence);
            }

它可以在一个线程上正常工作,但是一旦我给出了多个线程,它就会在Stanford解析函数中出现异常。例外看起来像这样 - &gt;

  

java.lang.ArrayIndexOutOfBoundsException:3       at java.util.ArrayList.add(ArrayList.java:441)       在edu.stanford.nlp.parser.lexparser.BaseLexicon.initRulesWithWord(BaseLexicon.java:300)       在edu.stanford.nlp.parser.lexparser.BaseLexicon.isKnown(BaseLexicon.java:160)       在edu.stanford.nlp.parser.lexparser.BaseLexicon.ruleIteratorByWord(BaseLexicon.java:212)       在edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.initializeChart(ExhaustivePCFGParser.java:1299)       在edu.stanford.nlp.parser.lexparser.ExhaustivePCFGParser.parse(ExhaustivePCFGParser.java:388)       在edu.stanford.nlp.parser.lexparser.LexicalizedParser.parse(LexicalizedParser.java:234)       在edu.stanford.nlp.parser.lexparser.LexicalizedParser.parse(LexicalizedParser.java:189)       在edu.cmu.ark.AnalysisUtilities.parseSentence(AnalysisUtilities.java:262)       在edu.cmu.ark.QuestionAsker $ 1.executeJob(QuestionAsker.java:147)       在edu.cmu.ark.QuestionAsker $ 1.executeJob(QuestionAsker.java:144)       在edu.cmu.ark.MultiThreadExecutor $ 1.run(MultiThreadExecutor.java:37)       在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)       at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:615)       在java.lang.Thread.run(Thread.java:745)   java.lang.RuntimeException:依赖关系不相等:“Spacious / CD” - &gt; “。*。/ CC”左0和“宽敞/ CD” - &gt; “轻松/ RB”正确1       在edu.stanford.nlp.parser.lexparser.MLEDependencyGrammar.probTB(MLEDependencyGrammar.java:586)       在edu.stanford.nlp.parser.lexparser.MLEDependencyGrammar.scoreTB(MLEDependencyGrammar.java:511)       在edu.stanford.nlp.parser.lexparser.AbstractDependencyGrammar.scoreTB(AbstractDependencyGrammar.java:229)       在edu.stanford.nlp.parser.lexparser.ExhaustiveDependencyParser.parse(ExhaustiveDependencyParser.java:322)       在edu.stanford.nlp.parser.lexparser.LexicalizedParser.parse(LexicalizedParser.java:244)       在edu.stanford.nlp.parser.lexparser.LexicalizedParser.parse(LexicalizedParser.java:189)       在edu.cmu.ark.AnalysisUtilities.parseSentence(AnalysisUtilities.java:262)       在edu.cmu.ark.QuestionAsker $ 1.executeJob(QuestionAsker.java:147)       在edu.cmu.ark.QuestionAsker $ 1.executeJob(QuestionAsker.java:144)       在edu.cmu.ark.MultiThreadExecutor $ 1.run(MultiThreadExecutor.java:37)       在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)       at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:615)       在java.lang.Thread.run(Thread.java:745)   显示java.lang.NullPointerException       在edu.stanford.nlp.parser.lexparser.BiLexPCFGParser.projectHooks(BiLexPCFGParser.java:342)       在edu.stanford.nlp.parser.lexparser.BiLexPCFGParser.processEdge(BiLexPCFGParser.java:546)       在edu.stanford.nlp.parser.lexparser.BiLexPCFGParser.processItem(BiLexPCFGParser.java:571)       在edu.stanford.nlp.parser.lexparser.BiLexPCFGParser.parse(BiLexPCFGParser.java:854)       在edu.stanford.nlp.parser.lexparser.LexicalizedParser.parse(LexicalizedParser.java:255)       在edu.stanford.nlp.parser.lexparser.LexicalizedParser.parse(LexicalizedParser.java:189)       在edu.cmu.ark.AnalysisUtilities.parseSentence(AnalysisUtilities.java:262)       在edu.cmu.ark.QuestionAsker $ 1.executeJob(QuestionAsker.java:147)       在edu.cmu.ark.QuestionAsker $ 1.executeJob(QuestionAsker.java:144)       在edu.cmu.ark.MultiThreadExecutor $ 1.run(MultiThreadExecutor.java:37)       在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)       at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:615)       在java.lang.Thread.run(Thread.java:745)

有什么办法吗?我可以提到先前被问到的question,但没有好处。

1 个答案:

答案 0 :(得分:1)

这是一个以多线程模式运行解析器的示例命令:

java -Xmx4g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse -parse.nthreads 4 -ssplit.eolonly -file some-sentences.txt -outputFormat text