斯坦福CoreNLP 3.9.1中国型号不加载

时间:2018-03-13 10:15:58

标签: stanford-nlp

我一直在使用 Stanford CoreNLP 进行中文处理。

我升级到最新版本3.9.1,发现中文分段器(和ssplit,pos)不起作用

这是我的" StanfordCoreNLP.Properties" 文件(位于"资源"文件夹下)

# Pipeline options - lemma is no-op for Chinese but currently needed because coref demands it (bad old requirements system)
annotators = tokenize, ssplit, pos

# segment
tokenize.language = zh
segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz
segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
segment.sighanPostProcessing = true

# sentence split
ssplit.boundaryTokenRegex = [.\u3002]|[!?\uFF01\uFF1F]+

# pos
pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger

# ner
ner.language = chinese
ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz
ner.applyNumericClassifiers = true
ner.useSUTime = false

# regexner
ner.fine.regexner.mapping = edu/stanford/nlp/models/kbp/chinese/cn_regexner_mapping.tab
ner.fine.regexner.noDefaultOverwriteLabels = CITY,COUNTRY,STATE_OR_PROVINCE

# parse
parse.model = edu/stanford/nlp/models/srparser/chineseSR.ser.gz

# depparse
depparse.model    = edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz
depparse.language = chinese

# coref
coref.sieves = ChineseHeadMatch, ExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatch
coref.input.type = raw
coref.postprocessing = true
coref.calculateFeatureImportance = false
coref.useConstituencyTree = true
coref.useSemantics = false
coref.algorithm = hybrid
coref.path.word2vec =
coref.language = zh
coref.defaultPronounAgreement = true
coref.zh.dict = edu/stanford/nlp/models/dcoref/zh-attributes.txt.gz
coref.print.md.log = false
coref.md.type = RULE
coref.md.liberalChineseMD = false

# kbp
kbp.semgrex = edu/stanford/nlp/models/kbp/chinese/semgrex
kbp.tokensregex = edu/stanford/nlp/models/kbp/chinese/tokensregex
kbp.language = zh
kbp.model = none

# entitylink
entitylink.wikidict = edu/stanford/nlp/models/kbp/chinese/wikidict_chinese.tsv.gz

这是使用Stanford CoreNLP的代码

public class CoreNlp {

    private static StanfordCoreNLP pipeline = new StanfordCoreNLP();
    private static HashSet<String> meaningless = new HashSet<>(Arrays.asList("AD","AS","BA","CC","CS","DEC","DEG","DER","DEV","DT","ETC","IJ",
            "LB","LC","MSP","ON","P","PN","PU","SB","SP","VC","VE"));
    public static List<String> annotating(String linea){
        List<String> words = new ArrayList<>();

        if(linea == null){
            return words;
        }

        String text = clean(linea);
        if(Util.isNull(text)){
            return words;
        }

        CoreDocument document = new CoreDocument(text);
        CoreNlp.pipeline.annotate(document);

        for (CoreLabel token:  document.tokens()) {
            String word = token.word();
            String pos = token.tag(); 
            if(meaningless.contains(pos)) {
                continue;
            }

            words.add(word);
        }

        return words;
    }

    private static String clean(String myString) {
        StringBuilder newString = new StringBuilder(myString.length());
        for (int offset = 0; offset < myString.length();)
        {
            int codePoint = myString.codePointAt(offset);
            offset += Character.charCount(codePoint);
            // Replace invisible control characters and unused code points
            switch (Character.getType(codePoint))
            {
                case Character.CONTROL:     // \p{Cc}
                case Character.FORMAT:      // \p{Cf}
                case Character.PRIVATE_USE: // \p{Co}
                case Character.SURROGATE:   // \p{Cs}
                case Character.UNASSIGNED:  // \p{Cn}
                case Character.SPACE_SEPARATOR: // \p{Zs}
                case Character.LINE_SEPARATOR: // \p{Zl}
                case Character.PARAGRAPH_SEPARATOR: // \p{Zp}
                    newString.append("");
                    break;
                default:
                    newString.append(Character.toChars(codePoint));
            }
        }
        return newString.toString();
    }
}

这是加载日志:

2018-03-13 16:22:54.178  INFO 1424 --- [io-10301-exec-5] e.stanford.nlp.pipeline.StanfordCoreNLP  : Searching for resource: StanfordCoreNLP.properties ... found.
2018-03-13 16:22:54.179  INFO 1424 --- [io-10301-exec-5] e.stanford.nlp.pipeline.StanfordCoreNLP  : Adding annotator tokenize
2018-03-13 16:22:54.194  INFO 1424 --- [io-10301-exec-5] e.s.nlp.pipeline.TokenizerAnnotator      : No tokenizer type provided. Defaulting to PTBTokenizer.
2018-03-13 16:22:54.280  INFO 1424 --- [io-10301-exec-5] e.stanford.nlp.pipeline.StanfordCoreNLP  : Adding annotator ssplit
2018-03-13 16:22:54.318  INFO 1424 --- [io-10301-exec-5] e.stanford.nlp.pipeline.StanfordCoreNLP  : Adding annotator pos
2018-03-13 16:22:55.241  INFO 1424 --- [io-10301-exec-5] e.s.nlp.tagger.maxent.MaxentTagger       : Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.9 sec].
似乎中国模特还没有加载。

因此,默认模型(英语)已用于segment,ssplit和pos;从而导致中文处理失败

请告知,

感谢

1 个答案:

答案 0 :(得分:0)

我通过“CoreNlp”类中的以下更改解决了这个问题:

private static StanfordCoreNLP pipeline = new StanfordCoreNLP("StanfordCoreNLP");

加载日志现在看起来像这样

2018-03-15 12:50:40.821  INFO 1460 --- [io-10301-exec-7] e.stanford.nlp.pipeline.StanfordCoreNLP  : Searching for resource: StanfordCoreNLP.properties ... found.
2018-03-15 12:50:41.185  INFO 1460 --- [io-10301-exec-7] e.stanford.nlp.pipeline.StanfordCoreNLP  : Adding annotator tokenize
2018-03-15 12:50:52.337  INFO 1460 --- [io-10301-exec-7] e.s.nlp.ie.AbstractSequenceClassifier    : Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... done [10.7 sec].
2018-03-15 12:50:52.393  INFO 1460 --- [io-10301-exec-7] e.stanford.nlp.pipeline.StanfordCoreNLP  : Adding annotator ssplit
2018-03-15 12:50:52.419  INFO 1460 --- [io-10301-exec-7] e.stanford.nlp.pipeline.StanfordCoreNLP  : Adding annotator pos
2018-03-15 12:50:53.292  INFO 1460 --- [io-10301-exec-7] e.s.nlp.tagger.maxent.MaxentTagger       : Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [0.8 sec].
2018-03-15 12:50:53.362  INFO 1460 --- [io-10301-exec-7] e.s.nlp.wordseg.ChineseDictionary        : Loading Chinese dictionaries from 1 file:
2018-03-15 12:50:53.362  INFO 1460 --- [io-10301-exec-7] e.s.nlp.wordseg.ChineseDictionary        :   edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
2018-03-15 12:50:53.657  INFO 1460 --- [io-10301-exec-7] e.s.nlp.wordseg.ChineseDictionary        : Done. Unique words in ChineseDictionary is: 423200.
2018-03-15 12:50:53.797  INFO 1460 --- [io-10301-exec-7] edu.stanford.nlp.wordseg.CorpusChar      : Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list [done].
2018-03-15 12:50:53.806  INFO 1460 --- [io-10301-exec-7] e.stanford.nlp.wordseg.AffixDictionary   : Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb [done].