如何使用排名预测来计算正常折扣累积收益-Python?

时间:2019-11-19 23:22:01

标签: python bash metrics ranking

我想使用NDCG(正态分布累积增益)作为评估排名预测的指标。我找到了一个定义如下的NDCG公式(https://github.com/sophwats/XGBoost-lambdaMART/blob/master/LambdaMART%20from%20XGBoost.ipynb)的github:

""" this returns 0 if all of the ordered data is undesirable"""
def ndcg_p(ordered_data, p):
    """normalised discounted cumulative gain"""
    if sum(ordered_data)==0:
        return 0
    else:
        indexloop = range(0, p)
        DCG_p = 0
        for index in indexloop:
            current_ratio=(2**(ordered_data[index])-1)*(math.log((float(index)+2), 2)**(-1))
            DCG_p = DCG_p + current_ratio
        ordered_data.sort(reverse=True)  
        K = len(ordered_data)
        indexloop = range(0, K)
        iDCG_p = 0
        for index in indexloop:
            current_ratio=(2**(ordered_data[index])-1)*((math.log((index+2), 2))**(-1))
            iDCG_p = iDCG_p + current_ratio
        return(DCG_p/iDCG_p)

f = open('../LearningToRank/MSLR-WEB10K/Fold1/test_dat.txt.group', 'r')
x = f.readlines()
groups =[]
for line in x:
    groups.append(int(line))
f.close()

testing_labels = testing_data.get_label()


## compute ndgc for each query.
nquerys=range(0,len(groups))
lower=0
upper=0
ndcgs=[]
for i in nquerys:
        many=groups[i]
        upper = upper+many
        predicted = preds[lower:upper]
        labled = testing_labels[lower:upper]
        ordered = [x for _,x in sorted(zip(predicted,labled), reverse=True)]
        result = ndcg_p(ordered, many)
        ndcgs.append(result)
        lower=upper

我无法根据情况调整x ... f文件是使用bash脚本构建的,如下所示:

## remove qid from the files.
sed 's/[[:space:]][a-z]*:[0-9]*//g' train.txt > train_dat.txt
sed 's/[[:space:]][a-z]*:[0-9]*//g' vali.txt > vali_dat.txt
sed 's/[[:space:]][a-z]*:[0-9]*//g' test.txt > test_dat.txt
echo "removed qids"
# extract the qids to their own files
grep -oh "qid:[0-9]*" train.txt > qids.txt
grep -oh "qid:[0-9]*" test.txt > test_qids.txt
grep -oh "qid:[0-9]*" vali.txt > vali_qids.txt
echo "extracted qids"
## make freq table from qids
uniq -c qids.txt > train1_dat.txt.group
uniq -c test_qids.txt > test1_dat.txt.group
uniq -c vali_qids.txt > vali1_dat.txt.group
echo "made frequency tables"
## extract first entry from frequency table
awk -F " " '{print $1}' train1_dat.txt.group > train_dat.txt.group
awk -F " " '{print $1}' test1_dat.txt.group > test_dat.txt.group
awk -F " " '{print $1}' vali1_dat.txt.group > vali_dat.txt.group

我认为分配给.txt.group的{​​{1}}文件是一个以查询频率表示查询ID的文件。

我有一个数据框x,其内容为:df(与查询ID相同),真实得分contest_id,真实排名true,预测{ {1}}使用xgboost和'objective':'rank:pairwise',以及我通过对预测进行排序而生成的预测排名rank

pred

如何计算NDCG?

0 个答案:

没有答案
相关问题