Question

我们正在使用斯坦福NER为法国报纸文本训练我们自己的（CRF）分类器。我们在标点符号方面遇到问题，特别是斯坦福NER似乎用其他人替换了一些标点符号。

以下是' in＆＃34; aujourd＆＃39; hui＆＃34;被`取代，«和»括起来的Ave-Maria被``和"取代。

输入原始文字：

" Aujourd'hui ... « Ave Maria » et ..."

斯坦福NER输出：

word    | tag | begin-offset | end-offset

Aujourd | O   | 31           | 38

`       | O   | 38           | 39

hui     | O   | 39           | 42


``      | O   | 331          | 332

Ave     | O   | 333          | 336

Maria   | O   | 337          | 342

''      | O   | 343          | 344

我们在创建分类器时测试了以下标志：

-outputFormatOptions includePunctuationDependencies

-inputEncoding utf-8 

-outputEncoding utf-8

但没有一个有效。

我将不胜感激。

Answer 1

以下是使用法语标记符标记法语文本的示例命令：

java -Xmx10g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-french.properties -file example-french-sentence-one.txt -outputFormat text

请注意tokenize属性：

tokenize.language = fr

这将告诉tokenizer使用法语标记器。

那应该处理Aujourd'hui的情况，但不幸的是，guillemets被硬编码为在法语词法分析器中转换为"，并且没有选项可以改变这种行为。

如果我有机会，我会尝试将更改推送到将该行为设置为可选的法语标记器。

您可以使用选项tokenize.whitespace将已标记化的文本提供给管道，如果您有另一种方法在将文本提交到Stanford CoreNLP之前对其进行标记，则只需提供按空格分割的每个标记。否则，您可能希望处理您的训练数据以匹配Stanford CoreNLP将其标记化的方式，这可能是另一种选择。

斯坦福NER标点符号

1 个答案: