我正在尝试将Spark MLib ALS与隐式反馈一起用于协同过滤。输入数据只有两个字段userId
和productId
。我有没有产品评级,只是有关用户购买的产品的信息,这些都是。所以训练ALS我用:
def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int): MatrixFactorizationModel
(http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS $)
此API需要Rating
对象:
Rating(user: Int, product: Int, rating: Double)
另一方面,关于trainImplicit
的文档告诉:在给定RDD隐含偏好的情况下训练矩阵分解模型'用户对某些产品的评分,格式为(userID,productID,偏好)。
当我将评分/偏好设置为1
时,如:
val ratings = sc.textFile(new File(dir, file).toString).map { line =>
val fields = line.split(",")
// format: (randomNumber, Rating(userId, productId, rating))
(rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0))
}
val training = ratings.filter(x => x._1 < 60)
.values
.repartition(numPartitions)
.cache()
val validation = ratings.filter(x => x._1 >= 60 && x._1 < 80)
.values
.repartition(numPartitions)
.cache()
val test = ratings.filter(x => x._1 >= 80).values.cache()
然后训练ALSL:
val model = ALS.trainImplicit(ratings, rank, numIter)
我得到RMSE 0.9,如果首选项取0或1值,这是一个很大的错误:
val validationRmse = computeRmse(model, validation, numValidation)
/** Compute RMSE (Root Mean Squared Error). */
def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = {
val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product)))
val predictionsAndRatings = predictions.map(x => ((x.user, x.product), x.rating))
.join(data.map(x => ((x.user, x.product), x.rating)))
.values
math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - x._2)).reduce(_ + _) / n)
}
所以我的问题是:我应该在rating
中设置什么值:
Rating(user: Int, product: Int, rating: Double)
用于隐式训练(在ALS.trainImplicit
方法中)?
更新
使用:
val alpha = 40
val lambda = 0.01
我明白了:
Got 1895593 ratings from 17471 users on 462685 products.
Training: 1136079, validation: 380495, test: 379019
RMSE (validation) = 0.7537217888106758 for the model trained with rank = 8 and numIter = 10.
RMSE (validation) = 0.7489005441881798 for the model trained with rank = 8 and numIter = 20.
RMSE (validation) = 0.7387672873747732 for the model trained with rank = 12 and numIter = 10.
RMSE (validation) = 0.7310003522283959 for the model trained with rank = 12 and numIter = 20.
The best model was trained with rank = 12, and numIter = 20, and its RMSE on the test set is 0.7302343904091481.
baselineRmse: 0.0 testRmse: 0.7302343904091481
The best model improves the baseline by -Infinity%.
我猜这仍是一个很大的错误。我也得到奇怪的基线改进,其中基线模型只是意味着(1)。
答案 0 :(得分:2)
您可以指定Alpha置信度。默认值为1.0:但请尝试降低。
val alpha = 0.01
val model = ALS.trainImplicit(ratings, rank, numIterations, alpha)
让我们知道这是怎么回事。
答案 1 :(得分:1)
根据 http://apache-spark-user-list.1001560.n3.nabble.com/ALS-implicit-error-pyspark-td16595.html, 'rating'可以是值&gt; 1。
根据 https://docs.prediction.io/templates/recommendation/training-with-implicit-preference/, 'rating'可以是给定用户+项目的观察数量
答案 2 :(得分:0)
因为你正在训练整套而不是使用&#34;训练&#34;它更加奇怪。子集
原始数据的分布情况如何?你有很多没有偏好的项目或者很喜欢的项目吗?
In&#34;隐式反馈数据集的协同过滤&#34;使用的alpha是40,你可能想尝试不同的值,但