训练斯坦福神经网络解析器的问题

时间:2015-05-06 12:12:43

标签: java parsing neural-network stanford-nlp

我目前正在尝试学习一个"递归神经网络解析器"按照http://nlp.stanford.edu/software/parser-faq.shtml#rnn

中显示的步骤操作

以下是我的设置:

  • 操作系统:Windows 7
  • 工具:自定义Stanford解析器已满2015-04-20
  • 数据:斯坦福大学维基百科页面
  • java:1.8

自定义解析器,因为我修改了DVParserCostAndGradient.java,在第273行删除了MulticoreWrapper的用法(因为它在此时保持不变):

MulticoreWrapper<Tree, Pair<DeepTree, DeepTree>> wrapper = new MulticoreWrapper<Tree, Pair<DeepTree, DeepTree>>(op.trainOptions.trainingThreads, new ScoringProcessor());
for (Tree tree : trainingBatch) {
    wrapper.put(tree);
}
wrapper.join();
scoreTiming.done();
while (wrapper.peek()) {
    Pair<DeepTree, DeepTree> result = wrapper.poll();
    [...]

给出了以下代码:

ScoringProcessor scorer = new ScoringProcessor();
for (Tree tree : trainingBatch) {
    Pair<DeepTree, DeepTree> result = scorer.process(tree);
    [...]

我发布了以下命令:

/ 1 /

java -mx2g -cp "ejml-0.23.jar;stanford-parser-custom.jar" edu.stanford.nlp.parser.lexparser.LexicalizedParser .\models\englishPCFG.ser.gz .\plain\1.txt > .\treebank\1.txt

/ 2 /

java -mx2g -cp "stanford-parser-custom.jar;stanford-parser-3.5.2-models.jar" edu.stanford.nlp.parser.dvparser.CacheParseHypotheses -model models/englishPCFG.ser.gz -treebank treebank 1 -output cached.wsj.ser.gz -numThreads 2 2> log.txt

/ 3 /

java -mx2g -cp "ejml-0.23.jar;stanford-parser-custom.jar" edu.stanford.nlp.parser.dvparser.DVParser -cachedTrees cached.wsj.ser.gz -train -testTreebank treebank 1 -debugOutputFrequency 500 -trainingThreads 1 -parser models/englishPCFG.ser.gz -dvIterations 40 -dvBatchSize 25 -wordVectorFile t2.txt -model models/RNNtry.ser.gz -unkword "-UNKNOWN-"

其中t2.txt是在&#34; stanford-parser-3.5.2-models.jar \ edu \ stanford \ nlp \ models \ parser \ nndep \ wsj_SD.gz&#34;

中找到的嵌入

我遇到的第一个问题是执行/ 2 /,在log.txt中,我可以读取每个句子:

WARNING: filtered all trees for

我相信一些解析树。

启动/ 3 /时,我有:

Iteration 0 batch 0
Converting trees ... done [0,1 sec].
Scoring trees ... Exception in thread "main" java.lang.AssertionError: Failed to get any hypothesis trees for

其次很可能与/ 2 /.

相同

之后,这个过程似乎挂断了。 / 3 /中的错误肯定是/ 2 /中的错误的结果。有没有人知道为什么每一棵树都在步骤/ 2 /过滤?

提前致谢,

编辑:

提供给LexicalizedParser模块的文本是从所有HTML代码中剥离的页面中未分段的文本:

Origins and early years (1885–1906)
The university officially opened on October 1, 1891 to 555 students. On the university's opening day, Founding President David Starr Jordan (1851–1931) said to Stanford's Pioneer Class: " is hallowed by no traditions; it is hampered by none. Its finger posts all point forward." However, much preceded the opening and continued for several years until the death of the last Founder, Jane Stanford, in 1905 and the destruction of the 1906 earthquake.

LexicalizedParser输出:

(ROOT
  (S
    (NP
      (NP
        (NP (NNS Origins))
        (CC and)
        (NP (JJ early) (NNS years)))
      (PRN (-LRB- -LRB-)
        (NP
          (NP (NNP 1885))
          (: --)
          (NP (CD 1906)))
        (-RRB- -RRB-)))
    (ADVP
      (NP (DT The) (NN university))
      (RB officially))
    (VP (VBD opened)
      (PP (IN on)
        (NP (NNP October) (CD 1) (, ,) (CD 1891)))
      (PP (TO to)
        (NP (CD 555) (NNS students))))
    (. .)))

[...]

我可以补充一下,堆栈跟踪中的树似乎是完全注释的,这是在步骤/ 2 /中给出的句子警告:

WARNING: filtered all trees for (ROOT (S^ROOT-v (@S^ROOT-v| VP^S-VBF-v_ ... ADVP^S< NP^S[ (NP^S (@NP^S| NP^NP-R_ PRN^N
P] (@NP^S| NP^NP-R_ (NP^NP-R (@NP^NP-R| NP^NP-B_ CC^NP-C> NP^NP-B] (@NP^NP-R| NP^NP-B_ CC^NP-C> (@NP^NP-R| NP^NP-B_ (NP
^NP-B (NNS^NP Origins))) (CC^NP-C and)) (NP^NP-B (@NP^NP-B| NNS^NP_ JJ^NP[ (JJ^NP early) (@NP^NP-B| NNS^NP_ (NNS^NP yea
rs))))))) (PRN^NP (@PRN^NP| NP^PRN-R_ -RRB-^PRN> -LRB-^PRN[ (-LRB-^PRN -LRB-) (@PRN^NP| NP^PRN-R_ -RRB-^PRN> (@PRN^NP| 
NP^PRN-R_ (NP^PRN-R (@NP^PRN-R| NP^NP-B_ :^NP> NP^NP-B] (@NP^PRN-R| NP^NP-B_ :^NP> (@NP^PRN-R| NP^NP-B_ (NP^NP-B (NNP^N
P 1885))) (:^NP --)) (NP^NP-B (CD^NP 1906))))) (-RRB-^PRN -RRB-)))))) (@S^ROOT-v| VP^S-VBF-v_ .^S> ADVP^S< (ADVP^S (@AD
VP^S| RB^ADVP_ NP^ADVP-B[ (NP^ADVP-B (@NP^ADVP-B| NN^NP_ DT^NP[ (DT^NP The) (@NP^ADVP-B| NN^NP_ (NN^NP university)))) (
@ADVP^S| RB^ADVP_ (RB^ADVP officially)))) (@S^ROOT-v| VP^S-VBF-v_ .^S> (@S^ROOT-v| VP^S-VBF-v_ (VP^S-VBF-v (@VP^S-VBF-v
| VBD^VP_ PP^VP> PP^VP] (@VP^S-VBF-v| VBD^VP_ PP^VP> (@VP^S-VBF-v| VBD^VP_ (VBD^VP opened)) (PP^VP (@PP^VP| IN^PP_ NP^P
P-B] (@PP^VP| IN^PP_ (IN^PP on)) (NP^PP-B (@NP^PP-B| NNP^NP_ ... ,^NP> CD^NP] (@NP^PP-B| NNP^NP_ CD^NP> ,^NP> (@NP^PP-B
| NNP^NP_ CD^NP> (@NP^PP-B| NNP^NP_ (NNP^NP October)) (CD^NP 1)) (,^NP ,)) (CD^NP 1891)))))) (PP^VP (@PP^VP| TO^PP_ NP^
PP-B] (@PP^VP| TO^PP_ (TO^PP to)) (NP^PP-B (@NP^PP-B| NNS^NP_ CD^NP[ (CD^NP 555) (@NP^PP-B| NNS^NP_ (NNS^NP students)))
)))))) (.^S .))))) (.$$. .$.))

0 个答案:

没有答案