有一个评论文本:
" 芭蕾舞短裙是为了我的内心...她喜欢它!它适合很好,适合她一段时间与松紧腰....质量很好,非常便宜!我会轻易地给她买另一个。"
并将其发送到CoreNLP服务器:
properties = {
"tokenize.whitespace": "true",
"annotators": "tokenize, ssplit, pos, lemma, ner, parse",
"outputFormat": "json"
}
if not isinstance(paragraph, str):
paragraph = unicodedata.normalize('NFKD', paragraph).encode('ascii', 'ignore')
result = self.nlp.annotate(paragraph, properties=properties)
给我这个结果:
{
u'sentences':[
{
u'parse':u'SENTENCE_SKIPPED_OR_UNPARSABLE',
u'index':0,
u'tokens':[
{
u'index':1,
u'word':u'The',
u'lemma':u'the',
u'pos':u'DT',
u'characterOffsetEnd':3,
u'characterOffsetBegin':0,
u'originalText':u'The'
},
{
u'index':2,
u'word':u"tutu's",
u'lemma':u"tutu'",
u'pos':u'NNS',
u'characterOffsetEnd':10,
u'characterOffsetBegin':4,
u'originalText':u"tutu's"
},
// ...
{
u'index':34,
u'word':u'easily.',
u'lemma':u'easily.',
u'pos':u'NN',
u'characterOffsetEnd':187,
u'characterOffsetBegin':180,
u'originalText':u'easily.'
}
]
}
]
}
我注意到句子没有被分割 - 任何想法可能是什么问题?
答案 0 :(得分:1)
不知道原因,但问题似乎来自tokenize.whitespace
。我刚评论过:
properties = {
#"tokenize.whitespace": "true",
"annotators": "tokenize, ssplit, pos, lemma, ner, parse",
"outputFormat": "json"
}