我是新的Mallet用户,我已经开始使用最新的稳定版本2.0.8。我的任务是编写序列标记器。
这是代码:
ArrayList<Pipe> pipes = new ArrayList<>();
pipes.add(new SaveDataInSource());
pipes.add(new CharSequence2TokenSequence());
pipes.add(new TokenTextCharPrefix("prefix1=", 1));
pipes.add(new TokenTextCharPrefix("prefix2=", 2));
pipes.add(new TokenTextCharSuffix("suffix1=", 1));
pipes.add(new TokenTextCharSuffix("suffix2=", 2));
pipes.add(new TokenText("word="));
pipes.add(new RegexMatches("CAPITALIZED", Pattern.compile("^\\p{Lu}.*")));
pipes.add(new RegexMatches("STARTSNUMBER", Pattern.compile("^[0-9].*")));
pipes.add(new RegexMatches("HYPHENATED", Pattern.compile(".*\\-.*")));
pipes.add(new TokenTextCharNGrams("bigram=", new int[] {2}));
pipes.add(new TokenTextCharNGrams("trigram=", new int[] {3}));
pipes.add(new MyTargetTagger());
pipes.add(new PrintTokenSequenceFeatures());
pipes.add(new TokenSequence2FeatureVectorSequence());
String[] str = new String[] {
"this is the first sentence John how are you",
"this is the second sentence Maria how are you",
"this is the third sentence Will how are you"
};
Pipe pipe = new SerialPipes(pipes);
InstanceList trainingInstances = new InstanceList(pipe);
trainingInstances.addThruPipe(new ArrayIterator(str));
CRF crf = new CRF(pipe, null);
crf.addStatesForThreeQuarterLabelsConnectedAsIn(trainingInstances);
crf.addStartState();
Instance r = crf.transduce(new Instance("this is a sentence Bruno how are you ?",null,null,null));
System.out.println(r.getData().toString());
如您所见,我使用了具有此代码的新管道(MyTargetTagger
):
public Instance pipe (Instance carrier)
{
TokenSequence ts = (TokenSequence) carrier.getData();
LabelSequence labelSeq = new LabelSequence(getTargetAlphabet());
for (int i = 0; i < ts.size(); i++) {
if (ts.get(i).getText().equals("John")) {
labelSeq.add("PERSON");
} else if (ts.get(i).getText().equals("Maria")) {
labelSeq.add("PERSON");
} else if (ts.get(i).getText().equals("Will")) {
labelSeq.add("PERSON");
} else {
labelSeq.add("O");
}
}
System.out.print(labelSeq.toString());
carrier.setTarget(labelSeq);
}
这是愚蠢的,我知道,但这只是一个测试,以了解如何解释目标标签。 这三个句子的标签是等于(显然):
0: O (0)
1: O (0)
2: O (0)
3: O (0)
4: O (0)
5: PERSON (1)
6: O (0)
7: O (0)
8: O (0)
如您所见,我还添加了pipes.add(new PrintTokenSequenceFeatures());
这是输出:
第一句:
name: array:0
O trigram=his trigram=thi bigram=is bigram=hi bigram=th word=this suffix2=is suffix1=s prefix2=th prefix1=t
O bigram=is word=is suffix1=s prefix1=i
O trigram=the bigram=he bigram=th word=the suffix2=he suffix1=e prefix2=th prefix1=t
O trigram=rst trigram=irs trigram=fir bigram=st bigram=rs bigram=ir bigram=fi word=first suffix2=st suffix1=t prefix2=fi prefix1=f
O trigram=nce trigram=enc trigram=ten trigram=nte trigram=ent trigram=sen bigram=ce bigram=nc bigram=en bigram=te bigram=nt bigram=en bigram=se word=sentence suffix2=ce suffix1=e prefix2=se prefix1=s
PERSON trigram=ohn trigram=Joh bigram=hn bigram=oh bigram=Jo CAPITALIZED word=John suffix2=hn suffix1=n prefix2=Jo prefix1=J
O trigram=how bigram=ow bigram=ho word=how suffix2=ow suffix1=w prefix2=ho prefix1=h
O trigram=are bigram=re bigram=ar word=are suffix2=re suffix1=e prefix2=ar prefix1=a
O trigram=you bigram=ou bigram=yo word=you suffix2=ou suffix1=u prefix2=yo prefix1=y
第二句:
name: array:1
O trigram=his trigram=thi bigram=is bigram=hi bigram=th word=this suffix2=is suffix1=s prefix2=th prefix1=t
O bigram=is word=is suffix1=s prefix1=i
O trigram=the bigram=he bigram=th word=the suffix2=he suffix1=e prefix2=th prefix1=t
O trigram=ond trigram=con trigram=eco trigram=sec bigram=nd bigram=on bigram=co bigram=ec bigram=se word=second suffix2=nd suffix1=d prefix2=se prefix1=s
O trigram=nce trigram=enc trigram=ten trigram=nte trigram=ent trigram=sen bigram=ce bigram=nc bigram=en bigram=te bigram=nt bigram=en bigram=se word=sentence suffix2=ce suffix1=e prefix2=se prefix1=s
PERSON trigram=ria trigram=ari trigram=Mar bigram=ia bigram=ri bigram=ar bigram=Ma CAPITALIZED word=Maria suffix2=ia suffix1=a prefix2=Ma prefix1=M
O trigram=how bigram=ow bigram=ho word=how suffix2=ow suffix1=w prefix2=ho prefix1=h
O trigram=are bigram=re bigram=ar word=are suffix2=re suffix1=e prefix2=ar prefix1=a
O trigram=you bigram=ou bigram=yo word=you suffix2=ou suffix1=u prefix2=yo prefix1=y
第三句:
name: array:2
O trigram=his trigram=thi bigram=is bigram=hi bigram=th word=this suffix2=is suffix1=s prefix2=th prefix1=t
O bigram=is word=is suffix1=s prefix1=i
O trigram=the bigram=he bigram=th word=the suffix2=he suffix1=e prefix2=th prefix1=t
O trigram=ird trigram=hir trigram=thi bigram=rd bigram=ir bigram=hi bigram=th word=third suffix2=rd suffix1=d prefix2=th prefix1=t
O trigram=nce trigram=enc trigram=ten trigram=nte trigram=ent trigram=sen bigram=ce bigram=nc bigram=en bigram=te bigram=nt bigram=en bigram=se word=sentence suffix2=ce suffix1=e prefix2=se prefix1=s
PERSON trigram=ill trigram=Wil bigram=ll bigram=il bigram=Wi CAPITALIZED word=Will suffix2=ll suffix1=l prefix2=Wi prefix1=W
O trigram=how bigram=ow bigram=ho word=how suffix2=ow suffix1=w prefix2=ho prefix1=h
O trigram=are bigram=re bigram=ar word=are suffix2=re suffix1=e prefix2=ar prefix1=a
O trigram=you bigram=ou bigram=yo word=you suffix2=ou suffix1=u prefix2=yo prefix1=y
当我这样做时:
Instance r = crf.transduce(new Instance("this is a sentence Bruno how are you",null,null,null));
System.out.println(r.getData().toString());
要查看新实例的性能,输出是:
P人员O PERSON O PERSON O
为什么输出这个?
我知道我需要大量数据来更好地训练我的模型。当然,但我想知道我的代码是否有问题。
非常感谢你!