如何用Mallet训练序列CRF模型

时间:2016-12-31 13:54:13

标签: java nlp mallet

我是新的Mallet用户,我已经开始使用最新的稳定版本2.0.8。我的任务是编写序列标记器。

这是代码:

ArrayList<Pipe> pipes = new ArrayList<>();

pipes.add(new SaveDataInSource());
pipes.add(new CharSequence2TokenSequence());
pipes.add(new TokenTextCharPrefix("prefix1=", 1));
pipes.add(new TokenTextCharPrefix("prefix2=", 2));  
pipes.add(new TokenTextCharSuffix("suffix1=", 1));
pipes.add(new TokenTextCharSuffix("suffix2=", 2));  
pipes.add(new TokenText("word="));  
pipes.add(new RegexMatches("CAPITALIZED", Pattern.compile("^\\p{Lu}.*")));
pipes.add(new RegexMatches("STARTSNUMBER", Pattern.compile("^[0-9].*")));
pipes.add(new RegexMatches("HYPHENATED", Pattern.compile(".*\\-.*")));                
pipes.add(new TokenTextCharNGrams("bigram=", new int[] {2}));                
pipes.add(new TokenTextCharNGrams("trigram=", new int[] {3}));                
pipes.add(new MyTargetTagger()); 
pipes.add(new PrintTokenSequenceFeatures()); 
pipes.add(new TokenSequence2FeatureVectorSequence()); 

String[] str = new String[] {
    "this is the first sentence John how are you",
    "this is the second sentence Maria how are you",
    "this is the third sentence Will how are you"
};                

Pipe pipe = new SerialPipes(pipes);

InstanceList trainingInstances = new InstanceList(pipe);
trainingInstances.addThruPipe(new ArrayIterator(str));            

CRF crf = new CRF(pipe, null);
crf.addStatesForThreeQuarterLabelsConnectedAsIn(trainingInstances);
crf.addStartState();

Instance r = crf.transduce(new Instance("this is a sentence Bruno how are you ?",null,null,null));                
System.out.println(r.getData().toString());

如您所见,我使用了具有此代码的新管道(MyTargetTagger):

public Instance pipe (Instance carrier)
{            
    TokenSequence ts = (TokenSequence) carrier.getData();           
    LabelSequence labelSeq = new LabelSequence(getTargetAlphabet());

    for (int i = 0; i < ts.size(); i++) {       
        if (ts.get(i).getText().equals("John")) {
            labelSeq.add("PERSON");
        } else if (ts.get(i).getText().equals("Maria")) {
            labelSeq.add("PERSON");
        } else if (ts.get(i).getText().equals("Will")) {
            labelSeq.add("PERSON");
        } else {
            labelSeq.add("O");
        }
    }

    System.out.print(labelSeq.toString());

    carrier.setTarget(labelSeq);            
}

这是愚蠢的,我知道,但这只是一个测试,以了解如何解释目标标签。 这三个句子的标签是等于(显然):

0: O (0)
1: O (0)
2: O (0)
3: O (0)
4: O (0)
5: PERSON (1)
6: O (0)
7: O (0)
8: O (0)

如您所见,我还添加了pipes.add(new PrintTokenSequenceFeatures());这是输出:

第一句:

name: array:0
O trigram=his trigram=thi bigram=is bigram=hi bigram=th word=this suffix2=is suffix1=s prefix2=th prefix1=t 
O bigram=is word=is suffix1=s prefix1=i 
O trigram=the bigram=he bigram=th word=the suffix2=he suffix1=e prefix2=th prefix1=t 
O trigram=rst trigram=irs trigram=fir bigram=st bigram=rs bigram=ir bigram=fi word=first suffix2=st suffix1=t prefix2=fi prefix1=f 
O trigram=nce trigram=enc trigram=ten trigram=nte trigram=ent trigram=sen bigram=ce bigram=nc bigram=en bigram=te bigram=nt bigram=en bigram=se word=sentence suffix2=ce suffix1=e prefix2=se prefix1=s 
PERSON trigram=ohn trigram=Joh bigram=hn bigram=oh bigram=Jo CAPITALIZED word=John suffix2=hn suffix1=n prefix2=Jo prefix1=J 
O trigram=how bigram=ow bigram=ho word=how suffix2=ow suffix1=w prefix2=ho prefix1=h 
O trigram=are bigram=re bigram=ar word=are suffix2=re suffix1=e prefix2=ar prefix1=a 
O trigram=you bigram=ou bigram=yo word=you suffix2=ou suffix1=u prefix2=yo prefix1=y 

第二句:

name: array:1
O trigram=his trigram=thi bigram=is bigram=hi bigram=th word=this suffix2=is suffix1=s prefix2=th prefix1=t 
O bigram=is word=is suffix1=s prefix1=i 
O trigram=the bigram=he bigram=th word=the suffix2=he suffix1=e prefix2=th prefix1=t 
O trigram=ond trigram=con trigram=eco trigram=sec bigram=nd bigram=on bigram=co bigram=ec bigram=se word=second suffix2=nd suffix1=d prefix2=se prefix1=s 
O trigram=nce trigram=enc trigram=ten trigram=nte trigram=ent trigram=sen bigram=ce bigram=nc bigram=en bigram=te bigram=nt bigram=en bigram=se word=sentence suffix2=ce suffix1=e prefix2=se prefix1=s 
PERSON trigram=ria trigram=ari trigram=Mar bigram=ia bigram=ri bigram=ar bigram=Ma CAPITALIZED word=Maria suffix2=ia suffix1=a prefix2=Ma prefix1=M 
O trigram=how bigram=ow bigram=ho word=how suffix2=ow suffix1=w prefix2=ho prefix1=h 
O trigram=are bigram=re bigram=ar word=are suffix2=re suffix1=e prefix2=ar prefix1=a 
O trigram=you bigram=ou bigram=yo word=you suffix2=ou suffix1=u prefix2=yo prefix1=y

第三句:

name: array:2
O trigram=his trigram=thi bigram=is bigram=hi bigram=th word=this suffix2=is suffix1=s prefix2=th prefix1=t 
O bigram=is word=is suffix1=s prefix1=i 
O trigram=the bigram=he bigram=th word=the suffix2=he suffix1=e prefix2=th prefix1=t 
O trigram=ird trigram=hir trigram=thi bigram=rd bigram=ir bigram=hi bigram=th word=third suffix2=rd suffix1=d prefix2=th prefix1=t 
O trigram=nce trigram=enc trigram=ten trigram=nte trigram=ent trigram=sen bigram=ce bigram=nc bigram=en bigram=te bigram=nt bigram=en bigram=se word=sentence suffix2=ce suffix1=e prefix2=se prefix1=s 
PERSON trigram=ill trigram=Wil bigram=ll bigram=il bigram=Wi CAPITALIZED word=Will suffix2=ll suffix1=l prefix2=Wi prefix1=W 
O trigram=how bigram=ow bigram=ho word=how suffix2=ow suffix1=w prefix2=ho prefix1=h 
O trigram=are bigram=re bigram=ar word=are suffix2=re suffix1=e prefix2=ar prefix1=a 
O trigram=you bigram=ou bigram=yo word=you suffix2=ou suffix1=u prefix2=yo prefix1=y

当我这样做时:

Instance r = crf.transduce(new Instance("this is a sentence Bruno how are you",null,null,null));                
System.out.println(r.getData().toString()); 

要查看新实例的性能,输出是:

P人员O PERSON O PERSON O

为什么输出这个?

我知道我需要大量数据来更好地训练我的模型。当然,但我想知道我的代码是否有问题。

非常感谢你!

0 个答案:

没有答案