Question

我想使用NDCG（正态分布累积增益）作为评估排名预测的指标。我找到了一个定义如下的NDCG公式（https://github.com/sophwats/XGBoost-lambdaMART/blob/master/LambdaMART%20from%20XGBoost.ipynb）的github：

""" this returns 0 if all of the ordered data is undesirable"""
def ndcg_p(ordered_data, p):
    """normalised discounted cumulative gain"""
    if sum(ordered_data)==0:
        return 0
    else:
        indexloop = range(0, p)
        DCG_p = 0
        for index in indexloop:
            current_ratio=(2**(ordered_data[index])-1)*(math.log((float(index)+2), 2)**(-1))
            DCG_p = DCG_p + current_ratio
        ordered_data.sort(reverse=True)  
        K = len(ordered_data)
        indexloop = range(0, K)
        iDCG_p = 0
        for index in indexloop:
            current_ratio=(2**(ordered_data[index])-1)*((math.log((index+2), 2))**(-1))
            iDCG_p = iDCG_p + current_ratio
        return(DCG_p/iDCG_p)

f = open('../LearningToRank/MSLR-WEB10K/Fold1/test_dat.txt.group', 'r')
x = f.readlines()
groups =[]
for line in x:
    groups.append(int(line))
f.close()

testing_labels = testing_data.get_label()


## compute ndgc for each query.
nquerys=range(0,len(groups))
lower=0
upper=0
ndcgs=[]
for i in nquerys:
        many=groups[i]
        upper = upper+many
        predicted = preds[lower:upper]
        labled = testing_labels[lower:upper]
        ordered = [x for _,x in sorted(zip(predicted,labled), reverse=True)]
        result = ndcg_p(ordered, many)
        ndcgs.append(result)
        lower=upper

我无法根据情况调整x ... f文件是使用bash脚本构建的，如下所示：

## remove qid from the files.
sed 's/[[:space:]][a-z]*:[0-9]*//g' train.txt > train_dat.txt
sed 's/[[:space:]][a-z]*:[0-9]*//g' vali.txt > vali_dat.txt
sed 's/[[:space:]][a-z]*:[0-9]*//g' test.txt > test_dat.txt
echo "removed qids"
# extract the qids to their own files
grep -oh "qid:[0-9]*" train.txt > qids.txt
grep -oh "qid:[0-9]*" test.txt > test_qids.txt
grep -oh "qid:[0-9]*" vali.txt > vali_qids.txt
echo "extracted qids"
## make freq table from qids
uniq -c qids.txt > train1_dat.txt.group
uniq -c test_qids.txt > test1_dat.txt.group
uniq -c vali_qids.txt > vali1_dat.txt.group
echo "made frequency tables"
## extract first entry from frequency table
awk -F " " '{print $1}' train1_dat.txt.group > train_dat.txt.group
awk -F " " '{print $1}' test1_dat.txt.group > test_dat.txt.group
awk -F " " '{print $1}' vali1_dat.txt.group > vali_dat.txt.group

我认为分配给.txt.group的{{1}}文件是一个以查询频率表示查询ID的文件。

我有一个数据框x，其内容为：df（与查询ID相同），真实得分contest_id，真实排名true，预测{ {1}}使用xgboost和'objective'：'rank：pairwise'，以及我通过对预测进行排序而生成的预测排名rank：

pred

如何计算NDCG？

如何使用排名预测来计算正常折扣累积收益-Python？

0 个答案: