比较文件时出错:IndexError:列表索引超出范围

时间:2017-01-07 13:21:12

标签: python offset

我有一个程序可以逐行比较文件,并通过读取两个文件夹a" gold folder"来计算精度。和一个"预测文件夹)。

提取的文件如下:

T1  Task 5 19   nonlinear wave
T2  Task 5 29   nonlinear wave equations
T3  Task 15 29  wave equations
T4  Task 86 111 general analytical method
T5  Task 94 111 analytical method
T6  Task 199 213    minimum stages
T7  Task 268 287    efficient technique
T8  Task 268 298    efficient technique relatingto

也是黄金档案:

T1  Process 5 14    oxidation
T2  Material 69 84  Ti-based alloys
T3  Material 186 192    alloys
T4  Task 264 349    understand the role that composition has on the oxidation behavior of Ti-based alloys
T5  Process 312 321 oxidation
T6  Material 334 349    Ti-based alloys
T7  Material 400 415    Ti-based alloys
T8  Material 445 451    alloys
T9  Process 480 489 oxidation

问题是此代码会生成此错误:

Traceback (most recent call last):
  File "C:\Users\chedi\Downloads\Semeval\eval.py", line 214, in <module>
    calculateMeasures(folder_gold, folder_pred, remove_anno)
  File "C:\Users\chedi\Downloads\Semeval\eval.py", line 31, in calculateMeasures
    res_full_pred, res_pred, spans_pred, rels_pred = normaliseAnnotations(f_pred, remove_anno)
  File "C:\Users\chedi\Downloads\Semeval\eval.py", line 130, in normaliseAnnotations
    r_g_offs = r_g[1].split(" ")
IndexError: list index out of range

错误在第130行和提取文件的格式中,但它们看起来格式相同:第一列和第二列用制表符分隔,空格偏移

    #!/usr/bin/python
# by Mattew Peters, who spotted that sklearn does macro averaging not micro averaging correctly and changed it

import os
from sklearn.metrics import precision_recall_fscore_support
import sys

def calculateMeasures(folder_gold="data/dev/", folder_pred="data_pred/dev/", remove_anno = ""):
    '''
    Calculate P, R, F1, Macro F
    :param folder_gold: folder containing gold standard .ann files
    :param folder_pred: folder containing prediction .ann files
    :param remove_anno: if set if "rel", relations will be ignored. Use this setting to only evaluate
    keyphrase boundary recognition and keyphrase classification. If set to "types", only keyphrase boundary recognition is evaluated.
    Note that for the later, false positive
    :return:
    '''

    flist_gold = os.listdir(folder_gold)
    res_all_gold = []
    res_all_pred = []
    targets = []

    for f in flist_gold:
        # ignoring non-.ann files, should there be any
        if not str(f).endswith(".ann"):
            continue
        f_gold = open(os.path.join(folder_gold, f), "r")
        try:
            f_pred = open(os.path.join(folder_pred, f), "r")
            res_full_pred, res_pred, spans_pred, rels_pred = normaliseAnnotations(f_pred, remove_anno)
        except IOError:
            print(f + " file missing in " + folder_pred + ". Assuming no predictions are available for this file.")
            res_full_pred, res_pred, spans_pred, rels_pred = [], [], [], []

        res_full_gold, res_gold, spans_gold, rels_gold = normaliseAnnotations(f_gold, remove_anno)

        spans_all = set(spans_gold + spans_pred)

        for i, r in enumerate(spans_all):
            if r in spans_gold:
                target = res_gold[spans_gold.index(r)].split(" ")[0]
                res_all_gold.append(target)
                if not target in targets:
                    targets.append(target)
            else:
                # those are the false positives, contained in pred but not gold
                res_all_gold.append("NONE")

            if r in spans_pred:
                target_pred = res_pred[spans_pred.index(r)].split(" ")[0]
                res_all_pred.append(target_pred)
            else:
                # those are the false negatives, contained in gold but not pred
                res_all_pred.append("NONE")


    #y_true, y_pred, labels, targets
    prec, recall, f1, support = precision_recall_fscore_support(
        res_all_gold, res_all_pred, labels=targets, average=None)
    # unpack the precision, recall, f1 and support
    metrics = {}
    for k, target in enumerate(targets):
        metrics[target] = {
            'precision': prec[k],
            'recall': recall[k],
            'f1-score': f1[k],
            'support': support[k]
        }

    # now micro-averaged
    if remove_anno != 'types':
        prec, recall, f1, s = precision_recall_fscore_support(
            res_all_gold, res_all_pred, labels=targets, average='micro')
        metrics['overall'] = {
            'precision': prec,
            'recall': recall,
            'f1-score': f1,
            'support': sum(support)
        }
    else:
        # just binary classification, nothing to average
        metrics['overall'] = metrics['KEYPHRASE-NOTYPES']

    print_report(metrics, targets)
    return metrics


def print_report(metrics, targets, digits=2):
    def _get_line(results, target, columns):
        line = [target]
        for column in columns[:-1]:
            line.append("{0:0.{1}f}".format(results[column], digits))
        line.append("%s" % results[columns[-1]])
        return line

    columns = ['precision', 'recall', 'f1-score', 'support']

    fmt = '%11s' + '%9s' * 4 + '\n'
    report = [fmt % tuple([''] + columns)]
    report.append('\n')
    for target in targets:
        results = metrics[target]
        line = _get_line(results, target, columns)
        report.append(fmt % tuple(line))
    report.append('\n')

    # overall
    line = _get_line(metrics['overall'], 'avg / total', columns)
    report.append(fmt % tuple(line))
    report.append('\n')

    print(''.join(report))


def normaliseAnnotations(file_anno, remove_anno):
    '''
    Parse annotations from the annotation files: remove relations (if requested), convert rel IDs to entity spans
    :param file_anno:
    :param remove_anno:
    :return:
    '''
    res_full_anno = []
    res_anno = []
    spans_anno = []
    rels_anno = []

    for l in file_anno:
        r_g = l.strip().split("\t")
        r_g_offs = r_g[1].split(" ")

        # remove relation instances if specified
        if remove_anno != "" and r_g_offs[0].endswith("-of"):
            continue

        res_full_anno.append(l.strip())
        # normalise relation instances by looking up entity spans for relation IDs
        if r_g_offs[0].endswith("-of"):
            arg1 = r_g_offs[1].replace("Arg1:", "")
            arg2 = r_g_offs[2].replace("Arg2:", "")
            for l in res_full_anno:
                r_g_tmp = l.strip().split("\t")
                if r_g_tmp[0] == arg1:
                    ent1 = r_g_tmp[1].replace(" ", "_")
                if r_g_tmp[0] == arg2:
                    ent2 = r_g_tmp[1].replace(" ", "_")

            spans_anno.append(" ".join([ent1, ent2]))
            res_anno.append(" ".join([r_g_offs[0], ent1, ent2]))
            rels_anno.append(" ".join([r_g_offs[0], ent1, ent2]))

        else:
            spans_anno.append(" ".join([r_g_offs[1], r_g_offs[2]]))
            keytype = r_g[1]
            if remove_anno == "types":
                keytype = "KEYPHRASE-NOTYPES"
            res_anno.append(keytype)



    for r in rels_anno:
        r_offs = r.split(" ")
        # reorder hyponyms to start with smallest index
        if r_offs[0] == "Synonym-of" and r_offs[2].split("_")[1] < r_offs[1].split("_")[1]:  # 1, 2
            r = " ".join([r_offs[0], r_offs[2], r_offs[1]])

        # Check, in all other hyponym relations, if the synonymous entity with smallest index is used for them.
        # If not, change it so it is.
        if r_offs[0] == "Synonym-of":
            for r2 in rels_anno:
                r2_offs = r2.split(" ")
                if r2_offs[0] == "Hyponym-of" and r_offs[1] == r2_offs[1]:
                    r_new = " ".join([r2_offs[0], r_offs[2], r2_offs[2]])
                    rels_anno[rels_anno.index(r2)] = r_new

                if r2_offs[0] == "Hyponym-of" and r_offs[1] == r2_offs[2]:
                    r_new = " ".join([r2_offs[0], r2_offs[1], r_offs[2]])
                    rels_anno[rels_anno.index(r2)] = r_new

    rels_anno = list(set(rels_anno))

    res_full_anno_new = []
    res_anno_new = []
    spans_anno_new = []

    for r in res_full_anno:
        r_g = r.strip().split("\t")
        if r_g[0].startswith("R") or r_g[0] == "*":
            continue
        ind = res_full_anno.index(r)
        res_full_anno_new.append(r)
        res_anno_new.append(res_anno[ind])
        spans_anno_new.append(spans_anno[ind])

    for r in rels_anno:
        res_full_anno_new.append("R\t" + r)
        res_anno_new.append(r)
        spans_anno_new.append(" ".join([r.split(" ")[1], r.split(" ")[2]]))

    return res_full_anno_new, res_anno_new, spans_anno_new, rels_anno


if __name__ == '__main__':
    folder_gold = "data/dev/"
    folder_pred = "data_pred/dev/"
    remove_anno = ""  # "", "rel" or "types"
    if len(sys.argv) >= 2:
        folder_gold = sys.argv[1]
    if len(sys.argv) >= 3:
        folder_pred = sys.argv[2]
    if len(sys.argv) == 4:
        remove_anno = sys.argv[3]

    calculateMeasures(folder_gold, folder_pred, remove_anno)

1 个答案:

答案 0 :(得分:1)

如果没有我自己的文件,我尝试使用&#34; gold&#34;您提供的文件,即:

T1      Process 5 14    oxidation
T2      Material 69 84  Ti-based alloys
T3      Material 186 192    alloys
T4      Task 264 349    understand the role that composition has on the oxidation behavior of Ti-based alloys
T5      Process 312 321 oxidation
T6      Material 334 349    Ti-based alloys
T7      Material 400 415    Ti-based alloys
T8      Material 445 451    alloys
T9      Process 480 489 oxidation

为了使程序能够正确运行而不会出现错误,您可以将索引列出超出范围&#39;在你提到的代码行中,基本的是在第一列(&#39; Ts&#39;)和第二列之间有一个制表符,在其他列之间有一个空格。如果没有以这种方式格式化的正确文件(例如,在前两列之间有空格而不是选项卡)将产生该错误。确实在行中真正发生了什么

r_g = l.strip('\n').split("\t")  

是首先在行尾删除换行符,然后通过制表符拆分行。这意味着该行被分成两个元素,组成列表r_g。在这种情况下,r_g_offs可以正确计算,并包含一个元素列表,这些元素是所有列,但第一列。在某些情况下,这将在后面用于例如 spans_anno.append(" ".join([r_g_offs[1], r_g_offs[2]]))
只提一个。

让我们看一下不起作用的案例,让我们试着理解为什么。 如果文件.ann(gold)没有以这种方式格式化:

  

T1 \ tProcess(之间的标签)

但是

  

T1流程(空间)

代码

r_g = l.strip('\n').split("\t")  

将生成一个只有一个元素而不是两个元素的列表,例如

  

r_g = [&#39; T1流程......&#39;]

在这种情况下,r_g只有一个元素,元素r_g[0]所以当一个人试图通过r_g[1])时>

r_g_offs = r_g[1].split()  

一个人会得到一个

  

IndexError:列表索引超出范围

还有另一种情况,你可以得到上述错误 如果文件末尾为空行,则为r_g = [''],这意味着r_g只是一个元素的列表。现在,类似于前一种情况,当脚本执行行r_g_offs = r_g[1].split()时,将尝试访问r_g[1],这不存在,因为在这种情况下列表中唯一的元素是{{ 1}},您将获得超出范围的&#39;列表索引&#39;错误。

我可以运行的代码:

r_g[0]
  

从上面显示的两种情况中,我们可以得出结论,脚本对文件的格式化/编写方式非常敏感(选项卡,空格和末尾没有空行),因此在生成这些文件时需要小心把它们送到主脚本。