我想使用NDCG(正态分布累积增益)作为评估排名预测的指标。我找到了一个定义如下的NDCG公式(https://github.com/sophwats/XGBoost-lambdaMART/blob/master/LambdaMART%20from%20XGBoost.ipynb)的github:
""" this returns 0 if all of the ordered data is undesirable"""
def ndcg_p(ordered_data, p):
"""normalised discounted cumulative gain"""
if sum(ordered_data)==0:
return 0
else:
indexloop = range(0, p)
DCG_p = 0
for index in indexloop:
current_ratio=(2**(ordered_data[index])-1)*(math.log((float(index)+2), 2)**(-1))
DCG_p = DCG_p + current_ratio
ordered_data.sort(reverse=True)
K = len(ordered_data)
indexloop = range(0, K)
iDCG_p = 0
for index in indexloop:
current_ratio=(2**(ordered_data[index])-1)*((math.log((index+2), 2))**(-1))
iDCG_p = iDCG_p + current_ratio
return(DCG_p/iDCG_p)
f = open('../LearningToRank/MSLR-WEB10K/Fold1/test_dat.txt.group', 'r')
x = f.readlines()
groups =[]
for line in x:
groups.append(int(line))
f.close()
testing_labels = testing_data.get_label()
## compute ndgc for each query.
nquerys=range(0,len(groups))
lower=0
upper=0
ndcgs=[]
for i in nquerys:
many=groups[i]
upper = upper+many
predicted = preds[lower:upper]
labled = testing_labels[lower:upper]
ordered = [x for _,x in sorted(zip(predicted,labled), reverse=True)]
result = ndcg_p(ordered, many)
ndcgs.append(result)
lower=upper
我无法根据情况调整x
... f
文件是使用bash脚本构建的,如下所示:
## remove qid from the files.
sed 's/[[:space:]][a-z]*:[0-9]*//g' train.txt > train_dat.txt
sed 's/[[:space:]][a-z]*:[0-9]*//g' vali.txt > vali_dat.txt
sed 's/[[:space:]][a-z]*:[0-9]*//g' test.txt > test_dat.txt
echo "removed qids"
# extract the qids to their own files
grep -oh "qid:[0-9]*" train.txt > qids.txt
grep -oh "qid:[0-9]*" test.txt > test_qids.txt
grep -oh "qid:[0-9]*" vali.txt > vali_qids.txt
echo "extracted qids"
## make freq table from qids
uniq -c qids.txt > train1_dat.txt.group
uniq -c test_qids.txt > test1_dat.txt.group
uniq -c vali_qids.txt > vali1_dat.txt.group
echo "made frequency tables"
## extract first entry from frequency table
awk -F " " '{print $1}' train1_dat.txt.group > train_dat.txt.group
awk -F " " '{print $1}' test1_dat.txt.group > test_dat.txt.group
awk -F " " '{print $1}' vali1_dat.txt.group > vali_dat.txt.group
我认为分配给.txt.group
的{{1}}文件是一个以查询频率表示查询ID的文件。
我有一个数据框x
,其内容为:df
(与查询ID相同),真实得分contest_id
,真实排名true
,预测{ {1}}使用xgboost和'objective':'rank:pairwise',以及我通过对预测进行排序而生成的预测排名rank
:
pred
如何计算NDCG?