我正在尝试实现以下论文中给出的算法1。 http://www.research.rutgers.edu/~lihong/pub/Li10Contextual.pdf
这是一种典型的勘探开发算法。我使用了公式payoff = mean + contant * standard deviation
首先,我为一组数据运行算法,然后从数据集中输入一条记录作为新输入,以查看它是否可以预测正确的输出。但它给出了错误的输出,所以我给了0奖励并重新计算了该臂的平均值和标准差,并继续算法。但每次它总是返回同一只手臂。平均值也不会改变。
有人向我解释当给出负面反馈时,此算法的均值和变化如何变化?我一直有什么理由得到相同的价值观?
我用java编程。代码如下。
public void LINUCB(double[] newFeature, Arm arm) {
LOGGER.log(Level.INFO, "LINUCB");
LOGGER.log(Level.INFO, "Arm number " + arm.getArmID());
if (arm.isNew()) {
arm.setFeatureMatrix(getIdentityMatrix(ConstantValues.FEATURE_DIMENSION));
arm.setResponseVector(new double[ConstantValues.FEATURE_DIMENSION]);
}
double[][] invertedFeatureMatrix = invert(arm.getFeatureMatrix());
/**The response vector is [D*M][M]. it is the multiplication of tranpose of design matrix with the user feedback provided to each trial M*/
//TODO use gradient descent here.
double[] theta = getSquareMatrixColumnVectorMultiplication(invertedFeatureMatrix, arm.getResponseVector());
double meanPayOff = getRowVectorColumnVectorMultiplication(theta, newFeature);
System.out.print(" meanPayOff " + meanPayOff);
double standardDeviation = calculateUCB(newFeature, arm.getFeatureMatrix());
System.out.print(" standardDeviation " + standardDeviation);
double payOffForArm = meanPayOff + standardDeviation;
System.out.print(" payOffForArm " + payOffForArm);
if (payOffForArm > maxPayOff) {
maxPayOff = payOffForArm;
//armWithMaxPayOff = arm;
//indexOfArmWithMaxPayOff = armArrayList.indexOf(arm);
maxPayOffArmID = arm.getArmID();
}
System.out.println(" ");
}
private double calculateUCB(double[] newFeature, double[][] featureMatrix) {
double[] tmpColumVector = getSquareMatrixColumnVectorMultiplication(featureMatrix, newFeature);
double tmpUCB = Math.sqrt(getRowVectorColumnVectorMultiplication(tmpColumVector, newFeature));
double UCB = ConstantValues.ALPHA * tmpUCB;
return UCB;
}
alpha设置为0.3。
答案 0 :(得分:1)
对于每一轮,LinUCB应根据其特征向量更新每个臂的上置信区域。我认为你已经错误地实现了算法。