Question

我建立了一个视频字幕模型。
它由一个Seq2seq模型组成，以视频作为输入并输出自然语言

我获得了非常好的测试结果，但推理结果却很糟糕：

Epoch 1 ; Batch loss: 5.181570 ; Batch accuracy: 60.28% ; Test accuracy: 00.89%
...
Epoch 128 ; Batch loss: 0.628466 ; Batch accuracy: 96.31% ; Test accuracy: 00.81%

说明

由于我的准确度功能，此准确性较低：它会将给定的结果单词逐字与标题进行比较。

由于教师的强迫机制，该计算适合训练，但不适合推论。

示例

真实描述：

a football match is going on <end>
the football player are made a goal <end>
the crowd cheers as soccer players work hard to gain control of the ball <end>

生成的说明：

a group of young men play a game of soccer <end>

我的模型可以正确理解正在发生的事情，但是并没有像期待的描述那样准确地（逐字逐句地）表达它。
对于此特定示例，准确性值将仅为1/31。

我如何明智地计算推理精度？

我考虑过提取句子的关键字。然后尝试查看标题中是否可以找到预测句子中包含的所有关键字。
但是我还必须检查该句子是否是正确的英语句子...

也许您正在考虑一种更简单的计算精度的方法。告诉我！

Answer 1

用户Bleu Score（又名双语评估研究分数），用于比较假设和参考。

def bleu_score(hypotheses, references):
    return nltk.translate.bleu_score.corpus_bleu(references, hypotheses)

示例：

# two references for one document
from nltk.translate.bleu_score import corpus_bleu
references = [[['this', 'is', 'a', 'test'], ['this', 'is' 'test']]]
hypotheses = [['this', 'is', 'a', 'test']]
score = corpus_bleu(references, hypotheses)
print(score)

输出：

1.0

其他方法是：

METEOR
ROUGE_L
CIDEr

关注：https://github.com/arjun-kava/Video2Description/blob/VideoCaption/cocoeval.py

如何评估seq2seq视频字幕模型的推理准确性？

说明

示例

我如何明智地计算推理精度？

1 个答案: