使用命名实体训练模型

时间:2015-04-20 18:43:01

标签: nlp stanford-nlp sentiment-analysis named-entity-recognition pos-tagger

我正在寻找使用命名实体REcognizer的Standford corenlp。我有不同类型的输入文本,我需要将它标记到我自己的Entity.So我开始训练我自己的模型,它似乎没有工作。

例如:我的输入文字字符串是“49本丰田兰德酷路泽杂志上的文章1956-1987黄金投资组合http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q

我通过示例来训练自己的模型,并只查找我感兴趣的一些单词。

我的jane-austen-emma-ch1.tsv看起来像这样

Toyota  PERS
Land Cruiser    PERS

从上面的输入文字中我只对这两个词感兴趣。一个是 丰田和另一个词是Land Cruiser。

austin.prop看起来像这样

trainFile = jane-austen-emma-ch1.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC

运行以下命令以生成ner-model.ser.gz文件

java -cp stanford-corenlp-3.4.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop

public static void main(String[] args) {
        String serializedClassifier = "edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz";
        String serializedClassifier2 = "C:/standford-ner/ner-model.ser.gz";
        try {
            NERClassifierCombiner classifier = new NERClassifierCombiner(false, false, 
                    serializedClassifier2,serializedClassifier);
            String ss = "Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q";
            System.out.println("---");
            List<List<CoreLabel>> out = classifier.classify(ss);
            for (List<CoreLabel> sentence : out) {
              for (CoreLabel word : sentence) {
                System.out.print(word.word() + '/' + word.get(AnswerAnnotation.class) + ' ');
              }
              System.out.println();
            }

        } catch (ClassCastException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }  catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

这是我得到的输出

Book/PERS of/PERS 49/O Magazine/PERS Articles/PERS on/O Toyota/PERS Land/PERS Cruiser/PERS 1956-1987/PERS Gold/O Portfolio/PERS http://t.co/EqxmY1VmLg/PERS http://t.co/F0Vefuoj9Q/PERS

我认为这是错误的。我正在寻找丰田/ PERS和Land Cruiser / PERS(这是一个多值的飞行。

感谢帮助。非常感谢任何帮助。

2 个答案:

答案 0 :(得分:2)

我相信您还应该在0中添加trainFile个实体的示例。正如您所说,trainFile对于学习来说太简单了,它需要 0PERSON示例,因此它不会注释一切都是PERSON。你没有教它关于你的非兴趣实体。说,像这样:

Toyota  PERS
of    0
Portfolio    0
49    0

等等。

此外,对于词组级别识别,您应该查看regexner,其中您可以拥有模式(模式对我们有利)。我正在使用API进行此操作,我有以下代码:

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner");
props.put("regexner.mapping", customLocationFilename);

使用以下customLocationFileName

Make Believe Town   figure of speech    ORGANIZATION
( /Hello/ [{ ner:PERSON }]+ )   salut   PERSON
Bachelor of (Arts|Laws|Science|Engineering) DEGREE
( /University/ /of/ [{ ner:LOCATION }] )    SCHOOL

和文字:Hello Mary Keller was born on 4th of July and took a Bachelor of Science. Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to University of London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney Weaver) says they will pay this on the usual credit terms (30 days).

我得到的输出

Hello Mary Keller is a salut
4th of July is a DATE
Bachelor of Science is a DEGREE
$ 100,000 is a MONEY
40 % is a PERCENT
15th August is a DATE
University of London is a ORGANIZATION
Make Believe Town is a figure of speech
Sigourney Weaver is a PERSON
30 days is a DURATION

有关如何执行此操作的详细信息,您可以查看让我离开的example

答案 1 :(得分:1)

NERClassifier *是单词级别,也就是说,它标记单词,而不是短语。鉴于此,分类器似乎表现良好。如果需要,可以对形成短语的单词进行连字。因此,在您的标签示例和测试示例中,您将制作&#34; Land Cruiser&#34;到&#34; Land_Cruiser&#34;。