Question

我试图运行Stanford CoreNLP server来标记包含非ASCII字符的文本。有时nlp.annotate()会返回一个字典，有时会返回一个字符串。

例如，

'''
From https://github.com/smilli/py-corenlp/blob/master/example.py
'''
from pycorenlp import StanfordCoreNLP
import pprint
import re

if __name__ == '__main__':
    nlp = StanfordCoreNLP('http://localhost:9000')
    text = u"tab with good effect, denies pain".encode('utf-8')
    print('type(text): {0}'.format(type(text)))

    output = nlp.annotate(text, properties={
        'annotators': 'tokenize,ssplit',
        'outputFormat': 'json'
    })
    #pp = pprint.PrettyPrinter(indent=4)
    #pp.pprint(output)
    print('type(output): {0}'.format(type(output)))

    text = u"tab with good effect\u0013\u0013, denies pain".encode('utf-8')
    print('\ntype(text): {0}'.format(type(text)))
    output = nlp.annotate(text, properties={
        'annotators': 'tokenize,ssplit',
        'outputFormat': 'json'
    })
    print('type(output): {0}'.format(type(output)))

输出：

type(text): <type 'str'>
type(output): <type 'dict'>

type(text): <type 'str'>
type(output): <type 'unicode'>

我注意到当type(output)为<type 'unicode'>时，我在Stanford CoreNLP服务器中收到此警告：

WARNING: Untokenizable: ‼ (U+13, decimal: 19)

有没有办法让nlp.annotate()始终返回相同类型的结果？

{{3}}使用以下命令启动：

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer 9000

我在Windows 7 SP1 x64 Ultimate上使用Stanford CoreNLP 3.6.0，pycorenlp 0.3.0和python 3.5 x64。

Answer 1

快速修复：

import json
# to place right after `output = nlp.annotate(text, properties={…})`
if type(output) is str or type(output) is unicode:
    output = json.loads(output, strict=False)

由于Python json.loads fails with ValueError: Invalid control character at: line 1 column 33 (char 33)，我使用了strict=False。

有没有办法让pycorenlp的`nlp.annotate（）`总是返回相同类型的结果？

1 个答案: