Question

我使用Stanford CoreNLP进行pos-tagging并在预先标记的中文文本上使用NER，我阅读了官方文档https://stanfordnlp.github.io/CoreNLP/tokenize.html，并说tokenize.whitespace选项＆＃39;如果设置为true，仅在遇到空格时分隔单词＆＃39;。这正是我想要的。

但是我使用python，pycorenlp与CoreNLP Server进行交互，对Java一无所知。然后我读了anwser How to NER and POS tag a pre-tokenized text with Stanford CoreNLP?，并认为可能唯一要做的就是添加＆＃39; tokenize.whitespace＆＃39; =＆＃39; true＆＃39;和我的请求后属性字典中的另一个属性，但它根本不起作用。我像这样运行我的服务器：

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port 9000 -timeout 150000

在我的jupyter笔记本中：

from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')

output = nlp.annotate('公司 作为 物联网 行业', properties={
    'annotators': 'pos,ner',
    'tokenize.whitespace': 'true', # first property
    'ssplit.eolonly': 'true', # second property
    'outputFormat': 'json'
})

for sentence in output['sentences']:
    print(' '.join([token['word'] for token in sentence['tokens']]))

给出：

公司 作为 物 联网 行业

CoreNLP仍在对令牌物品进行标记，就像我没有添加这两个属性一样。然后我尝试创建一个.properties文件并在命令行而不是StanfordCoreNLP-chinese.properties上使用它，但它也无法工作。在我的test.properties中：

tokenize.whitespace=true
ssplit.eolonly=true

然后我像这样运行服务器：

  java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties 'test.properties' -port 9000 -timeout 150000

仍然表现得好像我什么都没改变。有人知道我错过了什么吗？任何帮助表示赞赏：）

Answer 1

最后我解决了自己的问题。

对于中文文本使用tokenize.whitespace = true是很棘手的，似乎它永远不会起作用;相反，添加

'tokenize.language': 'Whitespace'

到您的属性字典或等效地添加

tokenize.language: Whitespace

到您的.properties文件中，以便完成任务。

此属性写在同一页https://stanfordnlp.github.io/CoreNLP/tokenize.html#options上，我之前没有注意到。为什么它存在两个属性用于同一目的，这有点令人困惑。

斯坦福CoreNLP tokenize.whitespace属性不适用于中文

1 个答案: