Question

我正在使用CoreNLP 3.6.0创建自定义NER模型

我的道具是：

# location of the training file 
trainFile = /home/damiano/stanford-ner.tsv 
# location where you would like to save (serialize) your 
# classifier; adding .gz at the end automatically gzips the file, 
# making it smaller, and faster to load 
serializeTo = ner-model.ser.gz

# structure of your training file; this tells the classifier that 
# the word is in column 0 and the correct answer is in column 1 
map = word=0,answer=1

# This specifies the order of the CRF: order 1 means that features 
# apply at most to a class pair of previous class and current class 
# or current class and next class. 
maxLeft=1

# these are the features we'd like to train with 
# some are discussed below, the rest can be 
# understood by looking at NERFeatureFactory 
useClassFeature=true 
useWord=true 
# word character ngrams will be included up to length 6 as prefixes 
# and suffixes only  
useNGrams=true 
noMidNGrams=true 
maxNGramLeng=6 
usePrev=true 
useNext=true 
useDisjunctive=true 
useSequences=true 
usePrevSequences=true 
# the last 4 properties deal with word shape features 
useTypeSeqs=true 
useTypeSeqs2=true 
useTypeySequences=true 
wordShape=chris2useLC

我用这个命令构建：

java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier  -prop /home/damiano/stanford-ner.prop

问题是当我使用此模型检索文本文件中的实体时。命令是：

java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile file.txt

file.txt 的位置是：

Hello!
my
name
is
John.

输出结果为：

你好/ O！/ O. 我/ O名/ O是/ O John / PERSON ./O

你可以看到它分裂了＃34;你好！＆＃34;分为两个令牌。对于约翰来说也是如此。＆＃34;

我必须使用空格标记器。

我该怎么设置它？

为什么CoreNlp会将这些单词分成两个标记？

Answer 1

通过指定tokenizerFactory标志/属性的类名来设置自己的标记生成器：

tokenizerFactory = edu.stanford.nlp.process.WhitespaceTokenizer$WhitespaceTokenizerFactory

您可以指定实现Tokenizer<T>接口的任何类，但包含的WhitespaceTokenizer听起来就像您想要的那样。如果tokenizer具有选项，您可以使用tokenizerOptions指定它们。例如，在此处，如果您还指定：

tokenizerOptions = tokenizeNLs=true

然后输入中的换行符将保留在输入中（对于不会将事物转换为每行一个令牌符号格式的输出选项）。

注意：tokenize.whitespace=true等选项适用于CoreNLP级别。如果提供给CRFClassifier等单个组件，则不会对它们进行解释（如果忽略该选项，则会收到警告）。

正如Nikita Astrakhantsev所说，这不一定是件好事。如果您的训练数据也是空白分隔，那么在测试时进行此操作只会是正确的，否则会对性能产生负面影响。拥有像你从空白分离中得到的标记这样的标记对于进行后续的NLP处理（例如解析）是不利的。

Answer 2

更新。如果您想在此处使用空白标记生成器，~~只需将tokenize.whitespace=true添加到您的属性文件中。~~查看Christopher Manning's answer。

然而，回答你的第二个问题，“为什么CoreNlp会将这些单词分成两个标记？”，我建议保留默认标记器（which is PTBTokenizer），因为它只是让我们获得更好的结果。通常，切换到空白标记化的原因是对处理速度或（通常 - 和）对标记化质量的低要求的高要求。由于你打算将它用于进一步的NER，我怀疑它是你的情况。

即使在您的示例中，如果您在标记化后有令牌John.，也无法通过公报或火车示例捕获它。可以找到更多详细信息以及为什么标记化不那么简单的原因here。

如何在NER模型上设置空白标记器？

2 个答案: