我一直在使用 Stanford CoreNLP 进行中文处理。
我升级到最新版本3.9.1,发现中文分段器(和ssplit,pos)不起作用
这是我的" StanfordCoreNLP.Properties" 文件(位于"资源"文件夹下)
# Pipeline options - lemma is no-op for Chinese but currently needed because coref demands it (bad old requirements system)
annotators = tokenize, ssplit, pos
# segment
tokenize.language = zh
segment.model = edu/stanford/nlp/models/segmenter/chinese/ctb.gz
segment.sighanCorporaDict = edu/stanford/nlp/models/segmenter/chinese
segment.serDictionary = edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
segment.sighanPostProcessing = true
# sentence split
ssplit.boundaryTokenRegex = [.\u3002]|[!?\uFF01\uFF1F]+
# pos
pos.model = edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger
# ner
ner.language = chinese
ner.model = edu/stanford/nlp/models/ner/chinese.misc.distsim.crf.ser.gz
ner.applyNumericClassifiers = true
ner.useSUTime = false
# regexner
ner.fine.regexner.mapping = edu/stanford/nlp/models/kbp/chinese/cn_regexner_mapping.tab
ner.fine.regexner.noDefaultOverwriteLabels = CITY,COUNTRY,STATE_OR_PROVINCE
# parse
parse.model = edu/stanford/nlp/models/srparser/chineseSR.ser.gz
# depparse
depparse.model = edu/stanford/nlp/models/parser/nndep/UD_Chinese.gz
depparse.language = chinese
# coref
coref.sieves = ChineseHeadMatch, ExactStringMatch, PreciseConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatch
coref.input.type = raw
coref.postprocessing = true
coref.calculateFeatureImportance = false
coref.useConstituencyTree = true
coref.useSemantics = false
coref.algorithm = hybrid
coref.path.word2vec =
coref.language = zh
coref.defaultPronounAgreement = true
coref.zh.dict = edu/stanford/nlp/models/dcoref/zh-attributes.txt.gz
coref.print.md.log = false
coref.md.type = RULE
coref.md.liberalChineseMD = false
# kbp
kbp.semgrex = edu/stanford/nlp/models/kbp/chinese/semgrex
kbp.tokensregex = edu/stanford/nlp/models/kbp/chinese/tokensregex
kbp.language = zh
kbp.model = none
# entitylink
entitylink.wikidict = edu/stanford/nlp/models/kbp/chinese/wikidict_chinese.tsv.gz
这是使用Stanford CoreNLP的代码
public class CoreNlp {
private static StanfordCoreNLP pipeline = new StanfordCoreNLP();
private static HashSet<String> meaningless = new HashSet<>(Arrays.asList("AD","AS","BA","CC","CS","DEC","DEG","DER","DEV","DT","ETC","IJ",
"LB","LC","MSP","ON","P","PN","PU","SB","SP","VC","VE"));
public static List<String> annotating(String linea){
List<String> words = new ArrayList<>();
if(linea == null){
return words;
}
String text = clean(linea);
if(Util.isNull(text)){
return words;
}
CoreDocument document = new CoreDocument(text);
CoreNlp.pipeline.annotate(document);
for (CoreLabel token: document.tokens()) {
String word = token.word();
String pos = token.tag();
if(meaningless.contains(pos)) {
continue;
}
words.add(word);
}
return words;
}
private static String clean(String myString) {
StringBuilder newString = new StringBuilder(myString.length());
for (int offset = 0; offset < myString.length();)
{
int codePoint = myString.codePointAt(offset);
offset += Character.charCount(codePoint);
// Replace invisible control characters and unused code points
switch (Character.getType(codePoint))
{
case Character.CONTROL: // \p{Cc}
case Character.FORMAT: // \p{Cf}
case Character.PRIVATE_USE: // \p{Co}
case Character.SURROGATE: // \p{Cs}
case Character.UNASSIGNED: // \p{Cn}
case Character.SPACE_SEPARATOR: // \p{Zs}
case Character.LINE_SEPARATOR: // \p{Zl}
case Character.PARAGRAPH_SEPARATOR: // \p{Zp}
newString.append("");
break;
default:
newString.append(Character.toChars(codePoint));
}
}
return newString.toString();
}
}
这是加载日志:
2018-03-13 16:22:54.178 INFO 1424 --- [io-10301-exec-5] e.stanford.nlp.pipeline.StanfordCoreNLP : Searching for resource: StanfordCoreNLP.properties ... found.
2018-03-13 16:22:54.179 INFO 1424 --- [io-10301-exec-5] e.stanford.nlp.pipeline.StanfordCoreNLP : Adding annotator tokenize
2018-03-13 16:22:54.194 INFO 1424 --- [io-10301-exec-5] e.s.nlp.pipeline.TokenizerAnnotator : No tokenizer type provided. Defaulting to PTBTokenizer.
2018-03-13 16:22:54.280 INFO 1424 --- [io-10301-exec-5] e.stanford.nlp.pipeline.StanfordCoreNLP : Adding annotator ssplit
2018-03-13 16:22:54.318 INFO 1424 --- [io-10301-exec-5] e.stanford.nlp.pipeline.StanfordCoreNLP : Adding annotator pos
2018-03-13 16:22:55.241 INFO 1424 --- [io-10301-exec-5] e.s.nlp.tagger.maxent.MaxentTagger : Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.9 sec].
似乎中国模特还没有加载。
因此,默认模型(英语)已用于segment,ssplit和pos;从而导致中文处理失败
请告知,
感谢
答案 0 :(得分:0)
我通过“CoreNlp”类中的以下更改解决了这个问题:
private static StanfordCoreNLP pipeline = new StanfordCoreNLP("StanfordCoreNLP");
加载日志现在看起来像这样
2018-03-15 12:50:40.821 INFO 1460 --- [io-10301-exec-7] e.stanford.nlp.pipeline.StanfordCoreNLP : Searching for resource: StanfordCoreNLP.properties ... found.
2018-03-15 12:50:41.185 INFO 1460 --- [io-10301-exec-7] e.stanford.nlp.pipeline.StanfordCoreNLP : Adding annotator tokenize
2018-03-15 12:50:52.337 INFO 1460 --- [io-10301-exec-7] e.s.nlp.ie.AbstractSequenceClassifier : Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... done [10.7 sec].
2018-03-15 12:50:52.393 INFO 1460 --- [io-10301-exec-7] e.stanford.nlp.pipeline.StanfordCoreNLP : Adding annotator ssplit
2018-03-15 12:50:52.419 INFO 1460 --- [io-10301-exec-7] e.stanford.nlp.pipeline.StanfordCoreNLP : Adding annotator pos
2018-03-15 12:50:53.292 INFO 1460 --- [io-10301-exec-7] e.s.nlp.tagger.maxent.MaxentTagger : Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [0.8 sec].
2018-03-15 12:50:53.362 INFO 1460 --- [io-10301-exec-7] e.s.nlp.wordseg.ChineseDictionary : Loading Chinese dictionaries from 1 file:
2018-03-15 12:50:53.362 INFO 1460 --- [io-10301-exec-7] e.s.nlp.wordseg.ChineseDictionary : edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
2018-03-15 12:50:53.657 INFO 1460 --- [io-10301-exec-7] e.s.nlp.wordseg.ChineseDictionary : Done. Unique words in ChineseDictionary is: 423200.
2018-03-15 12:50:53.797 INFO 1460 --- [io-10301-exec-7] edu.stanford.nlp.wordseg.CorpusChar : Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list [done].
2018-03-15 12:50:53.806 INFO 1460 --- [io-10301-exec-7] e.stanford.nlp.wordseg.AffixDictionary : Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb [done].