有没有办法让pycorenlp的`nlp.annotate()`总是返回相同类型的结果?

时间:2016-09-21 00:42:06

标签: python stanford-nlp

我试图运行Stanford CoreNLP server来标记包含非ASCII字符的文本。有时nlp.annotate()会返回一个字典,有时会返回一个字符串。

例如,

'''
From https://github.com/smilli/py-corenlp/blob/master/example.py
'''
from pycorenlp import StanfordCoreNLP
import pprint
import re

if __name__ == '__main__':
    nlp = StanfordCoreNLP('http://localhost:9000')
    text = u"tab with good effect, denies pain".encode('utf-8')
    print('type(text): {0}'.format(type(text)))

    output = nlp.annotate(text, properties={
        'annotators': 'tokenize,ssplit',
        'outputFormat': 'json'
    })
    #pp = pprint.PrettyPrinter(indent=4)
    #pp.pprint(output)
    print('type(output): {0}'.format(type(output)))

    text = u"tab with good effect\u0013\u0013, denies pain".encode('utf-8')
    print('\ntype(text): {0}'.format(type(text)))
    output = nlp.annotate(text, properties={
        'annotators': 'tokenize,ssplit',
        'outputFormat': 'json'
    })
    print('type(output): {0}'.format(type(output)))

输出:

type(text): <type 'str'>
type(output): <type 'dict'>

type(text): <type 'str'>
type(output): <type 'unicode'>

我注意到当type(output)<type 'unicode'>时,我在Stanford CoreNLP服务器中收到此警告:

WARNING: Untokenizable: ‼ (U+13, decimal: 19)

有没有办法让nlp.annotate()始终返回相同类型的结果?

{{3}}使用以下命令启动:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer 9000

我在Windows 7 SP1 x64 Ultimate上使用Stanford CoreNLP 3.6.0,pycorenlp 0.3.0和python 3.5 x64。

1 个答案:

答案 0 :(得分:0)

快速修复:

import json
# to place right after `output = nlp.annotate(text, properties={…})`
if type(output) is str or type(output) is unicode:
    output = json.loads(output, strict=False)

由于Python json.loads fails with ValueError: Invalid control character at: line 1 column 33 (char 33),我使用了strict=False