MLLib库中方法userFeatures或productFeatures的ALS模型的输出格式是什么?

时间:2016-05-02 13:14:55

标签: apache-spark rdd apache-spark-mllib

我有这样的评分数据集:(userId,itemId,rating)

1 100 4
1 101 5
1 102 3
1 10 3
1 103 5
4 353 2
4 354 4
4 355 5
7 420 5
7 421 4
7 422 4

我正在尝试使用ALS方法构建矩阵分解模型,以通过此代码获取用户潜在功能和产品潜在功能:

object AlsTest {
       def main(args: Array[String])
 {
   System.setProperty("hadoop.home.dir","C:\\spark-1.5.1-bin-hadoop2.6\\winutil")
   val conf = new SparkConf().setAppName("test").setMaster("local[4]")
   val sc = new SparkContext(conf)

   // Load and parse the data

val data = sc.textFile("ratings.txt")
val ratings = data.map(_.split(" ") match { case Array(user, item, rate) =>
  Rating(user.toInt, item.toInt, rate.toDouble)
})

// Build the recommendation model using ALS
val rank =10
val numIterations =30
val model = ALS.train(ratings, rank, numIterations, 0.01)


val a = model.productFeatures().cache().collect.foreach(println)                         //.cache().collect.count()//.collect.foreach(println)

我已经将等级设置为等于10,并且对于model.productFeatures()的输出格式应该是RDD:[(int,Array [Double])]但是当我看到out out有一些问题时,有输出中的一些字符(这些字符是什么)和记录中的数组元素的数量是不同的,这些是潜在的特征值,它们在每个记录中的计数也必须相等,这些不是十,完全等于排名。 out put是这样的:

(48791,7fea9bb7)
(48795,284b451d)
(48799,3d64767d)
(48803,2f812fc3)
(48807,49d3ea7)
(48811,768cf084)
(48815,6845b7b6)
(48819,4e9c724a)
(48823,23191538)
(48827,3200d90f)
(48831,77bd30fe)
(48839,5a1e0261)
(48843,31c56ccf)
(48855,5b90359)
(48863,1b9de9d0)
(48867,313afdc8)
(48871,2b834c34)
(48875,666d21d6)
(48891,12ca97a2)
(48907,74f8fc8e)
(48911,452becc9)
(48915,4a47062b)
(48919,c76ef46)
(48923,3f596eca)
(48927,258e904c)
(48939,570abc88)
(48947,6c3d75f0)
(48951,18667983)
(48955,493b9633)
(48959,4b579d60)
在矩阵分解中,我们应该构造两个具有较小维度的矩阵,使它们等于评级矩阵:

rating matrix= p*q(transpose), 
p= user latent feature matrix,
q= product latent features matrix,

任何人都可以解释一下spark中als方法的输出格式吗?

1 个答案:

答案 0 :(得分:1)

要查看每种产品的潜在因素,请使用以下语法:

model.productFeatures.collect().foreach{case (productID,latentFactors) => println("proID:"+ productID + " factors:"+ latentFactors.mkString(",") )}

给定数据集的结果如下:

proID:1 factors:-1.262960433959961,-0.5678719282150269,1.5220979452133179,2.2127938270568848,-2.096022129058838,3.2418994903564453,0.9077783823013306,1.1294238567352295,-0.0628235936164856,-0.6788621544837952
proID:2 factors:-0.6275356411933899,-2.0269076824188232,1.735855221748352,3.7356512546539307,0.8256714344024658,1.5638374090194702,1.6725327968597412,-1.9434666633605957,0.868758499622345,0.18945524096488953
proID:3 factors:-1.262960433959961,-0.5678719282150269,1.5220979452133179,2.2127938270568848,-2.096022129058838,3.2418994903564453,0.9077783823013306,1.1294238567352295,-0.0628235936164856,-0.6788621544837952
proID:4 factors:-0.6275356411933899,-2.0269076824188232,1.735855221748352,3.7356512546539307,0.8256714344024658,1.5638374090194702,1.6725327968597412,-1.9434666633605957,0.868758499622345,0.18945524096488953

正如您所看到的,每个产品都有10个因子,根据给定的参数val rank =10,这是一个正确的数字。

要回答第二个问题,请考虑在训练模型后,您可以访问两个变量,即userFeatures: RDD[(Int, Array[Double])]productFeatures: RDD[(Int, Array[Double])]。使用这两个变量的点积确定用户项矩阵的条目。例如,如果您查看predict方法的源代码,您可以了解我们如何使用这些变量来预测一种产品的特定用户的评级:

def predict(user: Int, product: Int): Double = {
     val userVector = userFeatures.lookup(user).head
     val productVector = productFeatures.lookup(product).head
     blas.ddot(rank, userVector, 1, productVector, 1)
}