如何从树库中为斯坦福NLP训练一个新的解析器模型?

时间:2016-03-03 01:58:20

标签: java parsing nlp stanford-nlp linguistics

我已经下载了UPDT波斯树库(Uppsala Persian Dependency Treebank),我正在尝试使用Stanford NLP构建一个依赖解析器模型。我尝试使用命令行和Java代码训练模型,但在这两种情况下我都得到例外。

1-使用命令行训练模型:

java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -train UPDT\train.conll 0 -saveToSerializedFile UPDT\updt.model.ser.gz

当我运行上面的命令时,我会得到这个例外:

done [read 26 trees]. Time elapsed: 0 ms
Options parameters:
useUnknownWordSignatures 0
smoothInUnknownsThreshold 100
smartMutation false
useUnicodeType false
unknownSuffixSize 1
unknownPrefixSize 1
flexiTag false
useSignatureForKnownSmoothing false
wordClassesFile null
parserParams edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams
forceCNF false
doPCFG true
doDep true
freeDependencies false
directional true
genStop true
distance true
coarseDistance false
dcTags true
nPrune false
Train parameters:
 smooth=false
 PA=true
 GPA=false
 selSplit=false
 (0.0)
 mUnary=0
 mUnaryTags=false
 sPPT=false
 tagPA=false
 tagSelSplit=false (0.0)
 rightRec=false
 leftRec=false
 collinsPunc=false
 markov=false
 mOrd=1
 hSelSplit=false (10)
 compactGrammar=0
 postPA=false
 postGPA=false
 selPSplit=false (0.0)
 tagSelPSplit=false (0.0)
 postSplitWithBase=false
 fractionBeforeUnseenCounting=0.5
 openClassTypesThreshold=50
 preTransformer=null
 taggedFiles=null
 predictSplits=false
 splitCount=1
 splitRecombineRate=0.0
 simpleBinarizedLabels=false
 noRebinarization=false
 trainingThreads=1
 dvKBest=100
 trainingIterations=40
 batchSize=25
 regCost=1.0E-4
 qnIterationsPerBatch=1
 qnEstimates=15
 qnTolerance=15.0
 debugOutputFrequency=0
 randomSeed=0
 learningRate=0.1
 deltaMargin=0.1
 unknownNumberVector=true
 unknownDashedWordVectors=true
 unknownCapsVector=true
 unknownChineseYearVector=true
 unknownChineseNumberVector=true
 unknownChinesePercentVector=true
 dvSimplifiedModel=false
 scalingForInit=0.5
 maxTrainTimeSeconds=0
 unkWord=*UNK*
 lowercaseWordVectors=false
 transformMatrixType=DIAGONAL
 useContextWords=false
 trainWordVectors=true
 stalledIterationLimit=12
 markStrahler=false

Using EnglishTreebankParserParams splitIN=0 sPercent=false sNNP=0 sQuotes=false
sSFP=false rbGPA=false j#=false jJJ=false jNounTags=false sPPJJ=false sTRJJ=fals
e sJJCOMP=false sMoreLess=false unaryDT=false unaryRB=false unaryPRP=false reflP
RP=false unaryIN=false sCC=0 sNT=false sRB=false sAux=0 vpSubCat=false mDTV=0 sV
P=0 sVPNPAgr=false sSTag=0 mVP=false sNP%=0 sNPPRP=false dominatesV=0 dominatesI
=false dominatesC=false mCC=0 sSGapped=0 numNP=false sPoss=0 baseNP=0 sNPNNP=0 s
TMP=0 sNPADV=0 cTags=false rightPhrasal=false gpaRootVP=false splitSbar=0 mPPTOi
IN=0 cWh=0
Binarizing trees...done. Time elapsed: 12 ms
Extracting PCFG...PennTreeReader: warning: file has extra non-matching right par
enthesis [ignored]
Exception in thread "main" java.lang.IllegalArgumentException: No head rule defi
ned for _ using class edu.stanford.nlp.trees.ModCollinsHeadFinder in (_
  DELM
  DELM
  DELM
  13
  punct
  _
  _
  15
  ??????
  _
  N
  N_SING
  SING
  13
  appos
  _
  _
  16
  ???????
  _
  ADJ
  ADJ
  ADJ
  15
  amod
  _
  _
  17
  ??
  _
  P
  P
  P
  15
  prep
  _
  _
  18
  ???
  _
  N
  N_SING
  SING
  17
  pobj
  _
  _
  19
  ?
  _
  CON
  CON
  CON
  18
  cc
  _
  _
  20
  ????
  _
  N
  N_SING
  SING
  18
  conj
  _
  _
  21
  ????
  _
  N
  N_SING
  SING
  20
  poss/pc
  _
  _
  22)
    at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineNonTrivialH
ead(AbstractCollinsHeadFinder.java:242)
     at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(Abstra
ctCollinsHeadFinder.java:189)
     at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(Abstra
ctCollinsHeadFinder.java:140)
     at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTreeHelper(T
reeAnnotator.java:145)
     at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTree(TreeAnn
otator.java:51)
     at edu.stanford.nlp.parser.lexparser.TreeAnnotatorAndBinarizer.transform
Tree(TreeAnnotatorAndBinarizer.java:104)
     at edu.stanford.nlp.trees.CompositeTreeTransformer.transformTree(Composi
teTreeTransformer.java:30)
     at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankItera
tor.next(TransformingTreebank.java:195)
     at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankItera
tor.next(TransformingTreebank.java:176)
     at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.pr
imeNext(FilteringTreebank.java:100)
     at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.<i
nit>(FilteringTreebank.java:85)
     at edu.stanford.nlp.trees.FilteringTreebank.iterator(FilteringTreebank.j
ava:72)
     at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.tallyTrees(Ab
stractTreeExtractor.java:64)
     at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.extract(Abstr
actTreeExtractor.java:89)
     at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserFromTree
bank(LexicalizedParser.java:881)
     at edu.stanford.nlp.parser.lexparser.LexicalizedParser.main(LexicalizedP
arser.java:1394)

2-使用Java代码训练模型:

import java.io.File;
import java.io.IOException;
import java.util.Collection;
import java.util.List;

import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
import edu.stanford.nlp.parser.lexparser.Options;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.trees.GrammaticalStructure;
import edu.stanford.nlp.trees.GrammaticalStructureFactory;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.Treebank;
import edu.stanford.nlp.trees.TreebankLanguagePack;


public class FromTreeBank {

    public static void main(String[] args) throws IOException {
        // TODO Auto-generated method stub

        String treebankPathUPDT = "src/model/UPDT.1.2/train.conll";
        String persianFilePath  = "src/txt/persianSentences.txt";

        File file = new File(treebankPathUPDT);

        Options op = new Options();   
        Treebank tr = op.tlpParams.diskTreebank();
        tr.loadPath(file);    
        LexicalizedParser lpc = LexicalizedParser.trainFromTreebank(tr,op);

        //Once the lpc is trained, use it to parse a file which contains Persian text  
        //demoDP(lpc, persianFilePath);
    }


    public static void demoDP(LexicalizedParser lp, String filename) {
        // This option shows loading, sentence-segmenting and tokenizing
        // a file using DocumentPreprocessor.
        TreebankLanguagePack tlp = lp.treebankLanguagePack(); // a PennTreebankLanguagePack for English
        GrammaticalStructureFactory gsf = null;
        if (tlp.supportsGrammaticalStructures()) {
            gsf = tlp.grammaticalStructureFactory();
        }
        // You could also create a tokenizer here (as below) and pass it
        // to DocumentPreprocessor
        for (List<HasWord> sentence : new DocumentPreprocessor(filename)) {
            Tree parse = lp.apply(sentence);
            parse.pennPrint();
            System.out.println();
            if (gsf != null) {
                GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
                Collection tdl = gs.typedDependenciesCCprocessed();
                System.out.println(tdl);
                System.out.println();
            }
        }
    }

}

Above Java程序也会出现此异常:

Options parameters:
useUnknownWordSignatures 0
smoothInUnknownsThreshold 100
smartMutation false
useUnicodeType false
unknownSuffixSize 1
unknownPrefixSize 1
flexiTag false
useSignatureForKnownSmoothing false
wordClassesFile null
parserParams edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams
forceCNF false
doPCFG true
doDep true
freeDependencies false
directional true
genStop true
distance true
coarseDistance false
dcTags true
nPrune false
Train parameters:
 smooth=false
 PA=true
 GPA=false
 selSplit=false
 (0.0)
 mUnary=0
 mUnaryTags=false
 sPPT=false
 tagPA=false
 tagSelSplit=false (0.0)
 rightRec=false
 leftRec=false
 collinsPunc=false
 markov=false
 mOrd=1
 hSelSplit=false (10)
 compactGrammar=0
 postPA=false
 postGPA=false
 selPSplit=false (0.0)
 tagSelPSplit=false (0.0)
 postSplitWithBase=false
 fractionBeforeUnseenCounting=0.5
 openClassTypesThreshold=50
 preTransformer=null
 taggedFiles=null
 predictSplits=false
 splitCount=1
 splitRecombineRate=0.0
 simpleBinarizedLabels=false
 noRebinarization=false
 trainingThreads=1
 dvKBest=100
 trainingIterations=40
 batchSize=25
 regCost=1.0E-4
 qnIterationsPerBatch=1
 qnEstimates=15
 qnTolerance=15.0
 debugOutputFrequency=0
 randomSeed=0
 learningRate=0.1
 deltaMargin=0.1
 unknownNumberVector=true
 unknownDashedWordVectors=true
 unknownCapsVector=true
 unknownChineseYearVector=true
 unknownChineseNumberVector=true
 unknownChinesePercentVector=true
 dvSimplifiedModel=false
 scalingForInit=0.5
 maxTrainTimeSeconds=0
 unkWord=*UNK*
 lowercaseWordVectors=false
 transformMatrixType=DIAGONAL
 useContextWords=false
 trainWordVectors=true
 stalledIterationLimit=12
 markStrahler=false

Using EnglishTreebankParserParams splitIN=0 sPercent=false sNNP=0 sQuotes=false sSFP=false rbGPA=false j#=false jJJ=false jNounTags=false sPPJJ=false sTRJJ=false sJJCOMP=false sMoreLess=false unaryDT=false unaryRB=false unaryPRP=false reflPRP=false unaryIN=false sCC=0 sNT=false sRB=false sAux=0 vpSubCat=false mDTV=0 sVP=0 sVPNPAgr=false sSTag=0 mVP=false sNP%=0 sNPPRP=false dominatesV=0 dominatesI=false dominatesC=false mCC=0 sSGapped=0 numNP=false sPoss=0 baseNP=0 sNPNNP=0 sTMP=0 sNPADV=0 cTags=false rightPhrasal=false gpaRootVP=false splitSbar=0 mPPTOiIN=0 cWh=0
Binarizing trees...done. Time elapsed: 122 ms
Extracting PCFG...PennTreeReader: warning: file has extra non-matching right parenthesis [ignored]
java.lang.IllegalArgumentException: No head rule defined for _ using class edu.stanford.nlp.trees.ModCollinsHeadFinder in (_
  DELM
  DELM
  DELM
  13
  punct
  _
  _
  15
  تلفیقی
  _
  N
  N_SING
  SING
  13
  appos
  _
  _
  16
  طنزآمیز
  _
  ADJ
  ADJ
  ADJ
  15
  amod
  _
  _
  17
  از
  _
  P
  P
  P
  15
  prep
  _
  _
  18
  اسم
  _
  N
  N_SING
  SING
  17
  pobj
  _
  _
  19
  و
  _
  CON
  CON
  CON
  18
  cc
  _
  _
  20
  شیوه
  _
  N
  N_SING
  SING
  18
  conj
  _
  _
  21
  کارش
  _
  N
  N_SING
  SING
  20
  poss/pc
  _
  _
  22)


    at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineNonTrivialHead(AbstractCollinsHeadFinder.java:242)
    at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(AbstractCollinsHeadFinder.java:189)
    at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(AbstractCollinsHeadFinder.java:140)
    at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTreeHelper(TreeAnnotator.java:145)
    at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTree(TreeAnnotator.java:51)
    at edu.stanford.nlp.parser.lexparser.TreeAnnotatorAndBinarizer.transformTree(TreeAnnotatorAndBinarizer.java:104)
    at edu.stanford.nlp.trees.CompositeTreeTransformer.transformTree(CompositeTreeTransformer.java:30)
    at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankIterator.next(TransformingTreebank.java:195)
    at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankIterator.next(TransformingTreebank.java:176)
    at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.primeNext(FilteringTreebank.java:100)
    at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.<init>(FilteringTreebank.java:85)
    at edu.stanford.nlp.trees.FilteringTreebank.iterator(FilteringTreebank.java:72)
    at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.tallyTrees(AbstractTreeExtractor.java:64)
    at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.extract(AbstractTreeExtractor.java:89)
    at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserFromTreebank(LexicalizedParser.java:881)
    at edu.stanford.nlp.parser.lexparser.LexicalizedParser.trainFromTreebank(LexicalizedParser.java:267)
    at edu.stanford.nlp.parser.lexparser.LexicalizedParser.trainFromTreebank(LexicalizedParser.java:278)
    at FromTreeBank.main(FromTreeBank.java:46)

实际上,我不确定命令行或Java代码是否正确。我无法弄清楚命令行或Java代码中缺少什么,如果有人告诉我为什么我得到这些异常并且出了什么问题,我将非常感激?或者建议从树库中训练模型的更好方法。

谢谢

3 个答案:

答案 0 :(得分:0)

如果您仍然想知道为什么会出现此错误,则与错误说明相同。对于这个角色&#34; _&#34; (我认为它的名字是下划线)edu.stanford.nlp.trees.ModCollinsHeadFinder类中没有定义规则。

我对括号字符有同样的意思,在删除包含括号的数据后,我可以毫无错误地训练stanford解析器。我还没有尝试通过更改代码找到解决它的直接解决方案。最简单的方法是删除包含像我这样的字符的数据。

如果你已经解决了问题,你可以分享一下吗?我还需要更多关于stanford解析器的知识。

答案 1 :(得分:0)

这里最大的问题是你正在尝试用依赖树库来训练一个选区树解析器(也就是短语 - 结构树解析器),这种方法不起作用。

CoreNLP还附带了一个基于神经网络的依赖解析器,您可以使用UPDT数据进行训练。请查看解析器的project page以获取有关如何训练模型的说明。

答案 2 :(得分:0)

您只需在“trainFile.conll”(或任何其他格式)中将所有“(”与“-LRB-”和所有“)”替换为“-RRB-”,然后重新运行解析器。 这对我有用。