从RDD创建2D矩阵

时间:2017-03-28 16:25:31

标签: apache-spark matrix rdd

我有以下类型的RDD((UserID,MovieID),1):

val data_wo_header=dropheader(data).map(_.split(",")).map(x=>((x(0).toInt,x(1).toInt),1))

我想将此数据结构转换为2D数组,以便原始RDD中存在的所有元素(userID Movie ID)都为1,否则为0。

我认为如果N是不同用户的数量,我们必须将用户ID映射到0-N,如果错误的是不同电影的数量,则将电影ID映射到0-N。

编辑:示例

        Movie ID->

Userid  1 2 3 4 5 6 7 

1       0 1 1 0 0 1 0 
2       0 1 0 1 0 0 0 
3       0 1 1 0 0 0 1 
4       1 1 0 0 1 0 0 
5       0 1 1 0 0 0 1 
6       1 1 1 1 1 0 0 
7       0 1 1 0 0 0 0 
8       0 1 1 1 0 0 1 
9       0 1 1 0 0 1 0 

The RDD will be of the sort
(userID, movID,rating)
101,1002,3.5
101,1003,2.5
101,1006,3
102,1002,3.5
102,1004,4.0
103,1002,1.0
103,1003,1.0
103,1007,5.0
….

2 个答案:

答案 0 :(得分:0)

val baseRDD = sc.parallelize(Seq((101, 1002, 3.5), (101, 1003, 2.5), (101, 1006, 3), (102, 1002, 3.5), (102, 1004, 4.0), (103, 1002, 1.0), (103, 1003, 1.0), (103, 1007, 5.0)))    
      baseRDD.map(x => (x._1, x._2)).groupByKey().foreach(println)

(userID,movID,rating)格式,如您所述

<强>结果:

(101,CompactBuffer(1002,1003,1006))

(102,CompactBuffer(1002,1004))

(103,CompactBuffer(1002,1003,1007))

答案 1 :(得分:0)

我设法使用以下功能生成2D矩阵。它采用格式

的RDD
((userID, movID),rating)
101,1002,3.5
101,1003,2.5
101,1006,3
102,1002,3.5
102,1004,4.0
103,1002,1.0
103,1003,1.0
103,1007,5.0

并返回特征矩阵:

def generate_characteristic_matrix(data_wo_header:RDD[((Int, Int), Int)]):Array[Array[Int]]={
    val distinct_user_IDs=data_wo_header.sortByKey().map(x=>x._1._1).distinct().collect().sorted
    val distinct_movie_IDs=data_wo_header.sortByKey().map(x=>x._1._2).distinct().collect().sorted

    var movie_count=distinct_movie_IDs.size
    var user_count=distinct_user_IDs.size

    var a =0
    var map_movie = new ArrayBuffer[(Int, Int)]()
    var map_user = new ArrayBuffer[(Int, Int)]()
    //map movie ID's from (0,movie_count)
    for( a <- 0 to movie_count-1){
      map_movie+=((distinct_movie_IDs(a),a))
    }
    //map user ID's from (0,user_count)
    for( a <- 0 to user_count-1){
      map_user+=((distinct_user_IDs(a),a))
    }
    //size of char matrix is user_countxmovie_count
    var char_matrix = Array.ofDim[Int](user_count,movie_count)
    data_wo_header.collect().foreach(x => {
      var user =x._1._1
      var movie=x._1._2
      var movie_mappedid=map_movie.filter(x=>x._1==movie).map(x=>x._2).toArray
      var user_mappedid=map_user.filter(x=>x._1==user).map(x=>x._2).toArray
      char_matrix(user_mappedid(0))(movie_mappedid(0))=1
    })
    return char_matrix
  }