Question

我使用Stanford Log-linear Part-Of-Speech Tagger，这是我标记的示例句子：

他不能这样做

标记后，我得到了这个结果：

He_PRP ca_MD n＆＃39; t_RB do_VB that_DT

如您所见，can't分为两个单词，ca标记为模态（MD），n't标记为ADVERB（RB）？

如果我单独使用can not，我实际上会得到相同的结果：can是MD而not是RB，所以这种分手方式是预期的，而不是说像{ {1}}和can_MD？

Answer 1

注意：这不是完美的答案我认为问题源于Stanford POS Tagger中使用的Tokenizer，而不是来自tagger本身。 Tokenizer（PTBTokenizer）无法正确处理撇号：
1- Stanford PTBTokenizer token's split delimiter。
2- Stanford coreNLP - split words ignoring apostrophe。
正如他们在这里提到的Stanford Tokenizer，PTBTokenizer会将句子标记为：

“哦，不，”她说，“我们400美元的搅拌机无法处理这个问题硬！“

为：

......
我们的¥
¥
搅拌器 ca
n' t
处理某事

尝试找到合适的标记化方法并将其应用于标记器，如下所示：

    import java.util.List;
    import edu.stanford.nlp.ling.HasWord;
    import edu.stanford.nlp.ling.Sentence;
    import edu.stanford.nlp.ling.TaggedWord;
    import edu.stanford.nlp.tagger.maxent.MaxentTagger;

    public class Test {

        public static void main(String[] args) throws Exception {
            String model = "F:/code/stanford-postagger-2015-04-20/models/english-left3words-distsim.tagger";  
            MaxentTagger tagger = new MaxentTagger(model);
            List<HasWord> sent;
            sent = Sentence.toWordList("He", "can", "'t", "do", "that", ".");
            //sent = Sentence.toWordList("He", "can't", "do", "that", ".");
            List<TaggedWord> taggedSent = tagger.tagSentence(sent);
            for (TaggedWord tw : taggedSent) {
                 System.out.print(tw.word() + "=" +  tw.tag() + " , " );

            }

        }

}

输出：

He = PRP，can = MD，'t = VB，do = VB，= DT，。=。，

为什么POS标记算法标记“不能”作为单独的单词？

1 个答案: