Question

我正在尝试使用coreNLP的功能对中文文本进行分段，POS和NER。我正在尝试将ClassB上的official StanfordCoreNLP python package与__init__一起使用。

我不知道如何告诉CoreNLP用中文工作。我已经从official corenlp website下载了stanford-corenlp-full-2018-02-27.zip文件以及windows 10文件。问题的部分原因是斯坦福大学的CoreNLP似乎有数百个python包装器，包括python 3.6，stanford-chinese-corenlp-2018-02-27-model.jar，nltk等；这使得我很难找到我需要对任何特定包装进行的处理。我目前有使用stanfordcorenlp软件包的英语。我怀疑解决方案是将中文.jar的语言或路径传递到分段器中。

英语代码（来自官方网站）

py-corenlp

尝试使用中文句子会产生POS或NER标记的存储错误，以及标记化的编码（我相信）错误。

Answer 1

corenlp-python充当CoreNLP Server的客户端。为方便起见，致电客户将启动default server，该服务用于英语NLP任务。

您可以自己start the server，并配置中文支持：

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port 9000 -timeout 15000

使用python客户端时，告诉它不需要启动default server：

with corenlp.CoreNLPClient(
        start_server=False,
        endpoint='http://localhost:9000',
        annotators="tokenize ssplit pos".split()) as client:
  ann = client.annotate(text)

在Python中将Stanford Corenlp用于中文

1 个答案: