我试图运行Stanford CoreNLP server来标记包含非ASCII字符的文本。有时nlp.annotate()
会返回一个字典,有时会返回一个字符串。
例如,
'''
From https://github.com/smilli/py-corenlp/blob/master/example.py
'''
from pycorenlp import StanfordCoreNLP
import pprint
import re
if __name__ == '__main__':
nlp = StanfordCoreNLP('http://localhost:9000')
text = u"tab with good effect, denies pain".encode('utf-8')
print('type(text): {0}'.format(type(text)))
output = nlp.annotate(text, properties={
'annotators': 'tokenize,ssplit',
'outputFormat': 'json'
})
#pp = pprint.PrettyPrinter(indent=4)
#pp.pprint(output)
print('type(output): {0}'.format(type(output)))
text = u"tab with good effect\u0013\u0013, denies pain".encode('utf-8')
print('\ntype(text): {0}'.format(type(text)))
output = nlp.annotate(text, properties={
'annotators': 'tokenize,ssplit',
'outputFormat': 'json'
})
print('type(output): {0}'.format(type(output)))
输出:
type(text): <type 'str'>
type(output): <type 'dict'>
type(text): <type 'str'>
type(output): <type 'unicode'>
我注意到当type(output)
为<type 'unicode'>
时,我在Stanford CoreNLP服务器中收到此警告:
WARNING: Untokenizable: ‼ (U+13, decimal: 19)
有没有办法让nlp.annotate()
始终返回相同类型的结果?
{{3}}使用以下命令启动:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer 9000
我在Windows 7 SP1 x64 Ultimate上使用Stanford CoreNLP 3.6.0,pycorenlp 0.3.0和python 3.5 x64。
答案 0 :(得分:0)
快速修复:
import json
# to place right after `output = nlp.annotate(text, properties={…})`
if type(output) is str or type(output) is unicode:
output = json.loads(output, strict=False)
由于Python json.loads fails with ValueError: Invalid control character at: line 1 column 33 (char 33)
,我使用了strict=False
。