Question

我是Stanford NLP和NER的新手，并尝试使用货币和国家/地区的数据集培训自定义分类器。

我在training-data-currency.tsv中的训练数据看起来像 -

USD CURRENCY
GBP CURRENCY

而且，training-data-countries.tsv中的训练数据看起来像 -

USA COUNTRY
UK  COUNTRY

而且，分类器属性看起来像 -

trainFileList = classifiers/training-data-currency.tsv,classifiers/training-data-countries.tsv
ner.model=classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz,classifiers/english.all.3class.distsim.crf.ser.gz
serializeTo = classifiers/my-classification-model.ser.gz
map = word=0,answer=1

useClassFeature=true
useWord=true
useNGrams=true
#no ngrams will be included that do not contain either the
#beginning or end of the word
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
#the next 4 deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC

查找类别的Java代码是 -

LinkedHashMap<String, LinkedHashSet<String>> map = new<String, LinkedHashSet<String>> LinkedHashMap();
NERClassifierCombiner classifier = null;
try {
    classifier = new NERClassifierCombiner(true, true, 
            "C:\\Users\\perso\\Downloads\\stanford-ner-2015-04-20\\stanford-ner-2015-04-20\\classifiers\\my-classification-model.ser.gz"
            );
} catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}
List<List<CoreLabel>> classify = classifier.classify("Zambia");
for (List<CoreLabel> coreLabels : classify) {
    for (CoreLabel coreLabel : coreLabels) {

        String word = coreLabel.word();
        String category = coreLabel
                .get(CoreAnnotations.AnswerAnnotation.class);
        if (!"O".equals(category)) {
            if (map.containsKey(category)) {
                map.get(category).add(word);
            } else {
                LinkedHashSet<String> temp = new LinkedHashSet<String>();
                temp.add(word);
                map.put(category, temp);
            }
            System.out.println(word + ":" + category);
        }

    }

}

当我运行上面的代码并输入＆＃34; USD＆＃34;或＆＃34;英国＆＃34;，我得到预期的结果为＆＃34; CURRENCY＆＃34;或＆＃34;国家＆＃34;。但是，当我输入类似＆＃34;俄罗斯＆＃34;之类的东西时，返回值是＆＃34; CURRENCY＆＃34;这是来自物业中的第一个火车档案。我期待着＆＃39; O＆＃39;将返回这些值，这些值在我的训练数据中不存在。

我该如何实现这种行为？我出错的任何指针都会非常有帮助。

Answer 1

嗨，我会尽力帮忙！

所以我觉得你有一个应该被称为“CURRENCY”的字符串列表，你有一个应该被称为“COUNTRY”的字符串列表等等......

并且您希望根据列表标记字符串。因此，当您看到“俄罗斯”时，您希望将其标记为“国家/地区”，当您看到“USD”时，您希望将其标记为“CURRENCY”。

我认为这些工具对您（尤其是第一个）更有帮助：

http://nlp.stanford.edu/software/regexner/

http://nlp.stanford.edu/software/tokensregex.shtml

NERClassifierCombiner旨在训练大量标记句子，并查看各种功能，包括大写和周围的单词，以猜测给定单词的NER标签。

但在你的情况下，我觉得你只想根据预定义的列表显式标记某些序列。所以我会探索上面提供的链接。

如果您需要更多帮助，请告诉我，我们很乐意跟进！

如何抑制斯坦福NER分类器中无与伦比的单词？

1 个答案: