创建单词对,三元组等,以便在Bleu中进行评估

时间:2014-04-21 16:58:15

标签: java string

我需要创建一个单词对,三元组等列表,以便在Bleu指标中进行评估。 Bleu以unigrams(单个单词)开头并上升到N-gram - 在运行时指定N.

例如,给出句子 "以色列官员负责机场安全"

对于unigrams,它只是一个单词列表。对于双字母,它将是

Israeli officials
officials are
are responsible
responsible for
for airport
airport security

相关的三元组是

Israeli officials are
officials are responsible
are responsible for
responsible for aiport
for airport security

我编写了一个工作的Bleu,它将NGrams硬编码为4并且蛮力地计算了unigrams等等。它很丑陋,而且,我需要能够提供N在运行时。

尝试生成对/三元组等的片段 -

    String current = "";
    int temp = 0;
    for (int i = 0; i < goldWords.length - N_GRAM_ORDER; i++) {
        current = current + ":" + goldWords[i];
        while (temp < N_GRAM_ORDER) {
            current = current + ":" + goldWords[temp + i];
            temp++;
        }
        goldNGrams.add(current);
        current = "";
        temp = 0;
    }
}

编辑 - 所以此代码段的输出应该是bigrams -

israeli:officials
officials:are
are:responsible
responsible:for
for:airport
airport:security

其中goldWords是一个String数组,包含要制作NGrams的单个单词。 我已经好好修补了这个循环了好几天,画出了关系等等,它只是不让我点击。谁能看到我做错了什么?

3 个答案:

答案 0 :(得分:1)

我会改变这个:

String current = "";
int temp = 0;
for (int i = 0; i < goldWords.length - N_GRAM_ORDER; i++) {
    current = current + ":" + goldWords[i];
    while (temp < N_GRAM_ORDER) {
        current = current + ":" + goldWords[temp + i];
        temp++;
    }
    goldNGrams.add(current);
    current = "";
    temp = 0;
}
}

到此:

 String current = "";
 for (int i = 0; i < goldWords.length(); i++){
     for (int j = 0; j < N_GRAM_ORDER; j++){
            if (i + j < goldWords.length())
                 current += ":" + goldWords[i + j];
     }
     goldNGrams.add(current);
     current = "";
 }

因此,外部for循环遍历要包含的第一个单词,内部循环遍历要包含的所有单词。需要注意的一点是,if语句用于防止数组超出边界错误。如果你只需要完整的n-gram,这应该移到内部for循环之外。

使用if语句,你会得到:

Israeli:officials
officials:are
are:responsible
responsible:for
for:airport
airport:security
security

如果你想:

Israeli:officials
officials:are
are:responsible
responsible:for
for:airport
airport:security

相反,请尝试以下代码:

 String current = "";
 for (int i = 0; i < goldWords.length(); i++){
     if (i + N_GRAM_ORDER < goldWords.length()){
         for (int j = 0; j < N_GRAM_ORDER; j++){
                 current += ":" + goldWords[i + j];
         }
     }
     goldNGrams.add(current);
     current = "";
 }

(上面的代码是在不对编译器进行检查的情况下完成的,因此可能会出现Off By One或次要语法错误。验证它,但它会让你关闭)。

答案 1 :(得分:1)

这是一个使用String []来收集ngrams而不是字符串的替代方法。我改变了外部for循环的迭代次数,以确保它捕获最后一个n-gram。

public static List<String[]> ngrams(String[] gold, int n_length) {
    List<String[]> list = new ArrayList<String[]>();
    for (int i = 0; i < gold.length - (n_length-1); i++) {
        String[] ngram = new String[n_length];
        for(int j = 0; j < n_length; j++) {
            ngram[j] = gold[i+j];
        }
        list.add(ngram);
    }
    return list;
}

答案 2 :(得分:1)

根据N_GRAM编程输出

  int N_GRAM_ORDER = 3, temp = 0, i;
        for (i = 0; i <= goldWords.length - N_GRAM_ORDER; i += N_GRAM_ORDER) {
            while (temp < N_GRAM_ORDER) {
                current = current + ":" + goldWords[temp + i];
                temp++;

            }
            goldGrams.add(current);
            current = "";
            temp = 0;
        }

        if ((temp + i) < goldWords.length) {
            temp += i;
            while (temp < goldWords.length) {

                current = current + ":" + goldWords[temp++];

            }
            goldGrams.add(current);

        }

    }

<强>输出

Israeli:officials:are
responsible:for:airport
security