将每个RDD值与scala中的RDD中的所有其他值配对

时间:2017-03-31 03:12:15

标签: scala apache-spark

我试图将RDD中的每个值与同一RDD的所有其他值配对。但我无法找到合适的解决方案。

RDD:以下图像表示具有对的RDD数据 - > (UserId,MovieName ::评级)。 image that shows the data of RDD -> (UserId, MovieName::Rating)

我想将每个用户的moviename和评级配对如下:

来自上图:

  
      
  • 用户1评为 Edison Kinetoscopic ..为10 La sortie ... as 10
  •   
  • 用户2评为 The Arrival .. as 8 Le manoir .. as 7 Edison Kinetoscopic .. as 7 等。
  •   

所以,输出应该是..

**key**: (Edison Kinetoscopic,La sortie des)  
**Value** : (10,10), (7,8)   -> Since user 1 and user two rated these two movies  
**Key**: (The Arrival, Le manoir)  
**value**: (8,7)    -> only user-2 rated these two movies. 

任何帮助表示感谢。

1 个答案:

答案 0 :(得分:-1)

如果您正在尝试构建推荐系统或计算电影电影的相似性,那么必须有更好的方法来实现这一点。

但是,要解决您的问题,您可以执行以下操作:

val rdd = sc.parallelize(List(
      (1,"Edison", 10),
      (1,"La sortie", 10),
      (2,"The Arrival", 8),
      (2,"Le manoir", 7),
      (2,"Edison", 7),
      (2,"La sortie", 8),
      (2,"Le voyage", 8),
      (2,"The Great", 7)
))

// first group user movies
val pairings = rdd.map{case (user,movie,rating) => (user, List((movie,rating)))}.reduceByKey(_++_)

// then get all pairs for each user
val allPairs = pairings.flatMap{case (user, movieRatings) => (1 until movieRatings.length).flatMap(i => movieRatings.zip(movieRatings drop i))}

// re-structure pairings into format we want
val finalPairing = allPairs.map{case ((m1,r1),(m2,r2)) => m1.compareTo(m2) match {case -1 => ((m1,m2),List((r1,r2))); case _ => ((m2,m1),List((r2,r1)))}}.

// group by pairings
val groupByPair = finalPairing.reduceByKey(_++_)

// look at our pairings
pairings.take(100).foreach(println)

需要compareTo来保证电影在元组中以相同的顺序出现,因此可以分组。