我正在制作一个仅使用'stanford-ner'包的自定义模型,以仅使用提供的一组标签来标记输入。到目前为止,我可以成功地在“实体”中创建一个带有空格的模型,但是我的输入已被标记化,没有被识别为单个术语
我的实体数据如下,
-DOCSTART- 0
stACkOverflOW BRAND
questions CATCHWORD
top votes CATEGORY
downvoted OFFENSIVE
我的属性文件是这个
trainFile = training_files/tags_data.tsv
serializeTo = ner-model-stackoverflow.ser.gz
wordFunction = edu.stanford.nlp.process.LowercaseFunction
map = word=0,answer=1
型号和服务器详细信息
java -cp "stanford-ner-2018-02-27/stanford-ner.jar:stanford-ner-2018-02-27/lib/*" -mx2g edu.stanford.nlp.ie.crf.CRFClassifier -prop training_files/prop.txt
cp stanford-ner-2018-02-27/stanford-ner.jar stanford-ner-with-classifier.jar
jar -uf stanford-ner-with-classifier.jar ner-model-stackoverflow.ser.gz
java -mx100m -cp stanford-ner-with-classifier.jar edu.stanford.nlp.ie.NERServer -port 9191 -loadClassifier ner-model-stackoverflow.ser.gz &
测试
telnet localhost 9191
stackoverflow | helps you //with questions, *helpful answers have top votes and less downvoted
输出如下
stackoverflow/BRAND |/O helps/O you/O //O //O with/O questions/CATCHWORD ,/O */O helpful/O answers/O have/O top/O votes/O and/O less/O downvoted/OFFENSIVE
如何处理输入,以确保“最高票数”被视为单个实体。如何验证我的模型是否将“最高票数”标记为实体或由于制表符问题而被跳过?我需要其他包裹吗?
我经历了'Stanford NLP named entities of more than one token',但是那是使用Java。我正在使用PHP套接字连接来连接到NER Server并获得响应。