使用MASI距离的NLTK协议的低alpha

时间:2017-08-17 17:50:45

标签: annotations nltk distance alpha

当我使用MASI作为距离函数计算NLTK中的协议时,我得到了Krippendorff alpha的非常低的值。

指示三个程序员(Inky,Blinky和Sue)根据文本的内容将主题标签(爱情,礼物,粘液或游戏)分配给两个文本(text01和text02)。每个文本可以是多个主题,因此编码器可以为每个文本分配多个标签。用于制作计算器的数据和代码如下所示:

import nltk
from nltk.metrics import agreement
from nltk.metrics.distance import masi_distance
from nltk.metrics.distance import jaccard_distance

#(coder, item, label)
data = [('inky','text01',frozenset(['love','gifts'])), 
      ('blinky','text01',frozenset(['love','gifts'])), 
      ('sue','text01',frozenset(['love','gifts'])), 
      ('inky','text02',frozenset(['slime','gaming'])), 
      ('blinky','text02',frozenset(['slime'])), 
      ('sue','text02',frozenset(['slime','gaming']))]

jaccard_task = nltk.AnnotationTask(distance=jaccard_distance)
masi_task = nltk.AnnotationTask(distance=masi_distance)
tasks = [jaccard_task, masi_task]
for task in tasks:
    task.load_array(data)
    print("Statistics for dataset using {}".format(task.distance))
    print("C: {}\nI: {}\nK: {}".format(task.C, task.I, task.K))
    print("Pi: {}".format(task.pi()))
    print("Kappa: {}".format(task.kappa()))
    print("Multi-Kappa: {}".format(task.multi_kappa()))
    print("Alpha: {}".format(task.alpha()))
    print()

当我运行代码时,我得到以下结果:

Statistics for dataset using <function jaccard_distance at 0x09D26DB0>
C: {'inky', 'sue', 'blinky'}
I: {'text01', 'text02'}
K: {frozenset({'slime'}), frozenset({'love', 'gifts'}), frozenset   ({'gaming', 'slime'})}
Pi: 0.7272727272727273
Kappa: 0.7777777777777777
Multi-Kappa: 0.7499999999999999
Alpha: 0.75

Statistics for dataset using <function masi_distance at 0x09D26DF8>
C: {'inky', 'sue', 'blinky'}
I: {'text01', 'text02'}
K: {frozenset({'slime'}), frozenset({'love', 'gifts'}), frozenset({'gaming', 'slime'})}
Pi: 0.8172727272727272
Kappa: 0.8511111111111113
Multi-Kappa: 0.8324999999999998
Alpha: -1.5

我的问题是,为什么使用MASI距离函数时的alpha值与Jaccard相比如此之低?

1 个答案:

答案 0 :(得分:0)

运行提供的代码时,我无法重现该错误,并获得了正确的Krippendorff alpha和MASI距离值。我使用了Python 3.5.2,NumPy 1.18.2,NLTK 3.4.5。因此,最可能的答案是需要更新NLTK。