Question

我正在使用自定义NER模型（基于CRF）进行NER标记。但是问题在于，只要在测试数据中出现多个由标点或任何停用词分隔的实体。它将整体标记为一个实体。

例如-
为“印度，中国”生产的
（u'India'，u'B-LOC'），（u'，'，u'I-LOC'），（u'china'，u'I-LOC'）
并为“印度和澳大利亚”生产
（u'india'，u'B-LOC'），（u'and'，u'I-LOC'），（u'australia'，u'I-LOC'）

我尚未从训练数据集中删除任何停用词或标点符号，它们被标记为“ O”。但是，为什么我得到了作为单个实体一部分在两个实体之间出现的这些打孔和停用词？

这是我在模型训练中使用的属性文件和数据集-

属性文件（ner.prop）

 trainFile=Clean_Data.tsv
 serializeTO=ner-model_cleanGazette_full.ser.gz
 map = word=0,answer=1,tag=2
 useClassFeature=true
 useWord=true
 useNGrams=true
 noMidNGrams=true
 qnSize=10
 entitySubclassification=IOB2
 retainEntitySubclassification=true
 maxNGramLeng=6
 usePrev=true
 useNext=true
 useSequences=true
 usePrevSequences=true
 useTypeSeqs=true
 useTypeSeqs2=true
 useTypeySequences=true
 wordShape=chris2useLC
 useDisjunctive=true
 useGazettes=true
 gazette=gazetta.txt
 sloppyGazette=true

Kaggle Dataset used（Clean_Data.tsv）

**Word    ner     pos**
Thousands   O   NNS
of  O   IN
demonstrators   O   NNS
have    O   VBP
marched O   VBN
through O   IN
London  B-LOC   NNP
to  O   TO
protest O   VB

我还可以添加或删除哪些其他方法来解决此问题？

为什么我将多个实体作为一个实体

0 个答案: