Question

我正在使用Vowpal Wabbit的python API来训练命名实体识别分类器，以便从短句中检测人员，组织和位置的名称。我已经汇总了一个IPython Notebook，其中包含有关数据的详细信息，如何训练模型以及在评估语句中确定的实体。培训数据来自ATIS和CONLL 2003数据集。

我的Vowpal Wabbit SearchTask类的设置（基于this tutorial）：

class SequenceLabeler(pyvw.SearchTask):
    def __init__(self, vw, sch, num_actions):
        pyvw.SearchTask.__init__(self, vw, sch, num_actions)

        sch.set_options( sch.AUTO_HAMMING_LOSS | sch.AUTO_CONDITION_FEATURES )

    def _run(self, sentence):
        output = []
        for n in range(len(sentence)):
            pos,word = sentence[n]
            with self.vw.example({'w': [word]}) as ex:
                pred = self.sch.predict(examples=ex, my_tag=n+1, oracle=pos, condition=[(n,'p'), (n-1, 'q')])
                output.append(pred)
        return output

模特训练：

vw = pyvw.vw(search=num_labels, search_task='hook', ring_size=1024)
#num_labels = 3 ('B'eginning entity, 'I'nside entity, 'O'ther)

sequenceLabeler = vw.init_search_task(SequenceLabeler)    
sequenceLabeler.learn(training_set)

该模型在训练数据中存在的命名实体（精确字符串匹配）上表现良好，但对使用相同结构的新示例的概括性较差。也就是说，分类器将识别训练数据中句子中存在的实体，但是当我只更改名称时，它们的表现很差。

sample_sentences = ['new york to las vegas on sunday afternoon', 
                    'chennai to mumbai on sunday afternoon',
                    'lima to ascuncion on sunday afternoon']

运行分类器时的输出：

new york to las vegas on sunday afternoon
locations - ['new york', 'las vegas']

chennai to mumbai on sunday afternoon
locations - []

lima to ascuncion on sunday afternoon
locations - []

这表明即使句子保持不变：“星期天下午”a到b，模型也无法识别新位置，也许是因为它记住了训练样本？

类似的结果适用于organisation和person分类器。这些可以在我的Github中找到。

我的问题是 -

我在这里做错了什么？
我可以改变模型的其他参数吗？或者我可以更好地使用现有的ring_size和search_task？
您可以提供哪些建议来改善模型的一般性？

Answer 1

您没有使用地名录，也没有使用ortographic功能（例如--spelling或--affix），您的数据都是小写的，因此唯一可以提供帮助的功能是unigram和bigram身份。你过度训练训练数据也就不足为奇了。从理论上讲，你可以使用符合模式的人工命名实体（星期日的x到y）来提升你的训练数据，但是如果这有帮助的话，建立一个基于规则的分类器会更容易。
有许多参数，例如-l（学习率）和--passes。请参阅tutorial和list of options。请注意，ring_size不会影响预测质量，您只需将其设置得足够高，以免得到任何警告（即高于最长序列）。
见1

使用Vowpal Wabbit的命名实体识别似乎记住了训练数据

1 个答案: