我有以下类型的RDD((UserID,MovieID),1):
val data_wo_header=dropheader(data).map(_.split(",")).map(x=>((x(0).toInt,x(1).toInt),1))
我想将此数据结构转换为2D数组,以便原始RDD中存在的所有元素(userID Movie ID)都为1,否则为0。
我认为如果N是不同用户的数量,我们必须将用户ID映射到0-N,如果错误的是不同电影的数量,则将电影ID映射到0-N。
编辑:示例
Movie ID->
Userid 1 2 3 4 5 6 7
1 0 1 1 0 0 1 0
2 0 1 0 1 0 0 0
3 0 1 1 0 0 0 1
4 1 1 0 0 1 0 0
5 0 1 1 0 0 0 1
6 1 1 1 1 1 0 0
7 0 1 1 0 0 0 0
8 0 1 1 1 0 0 1
9 0 1 1 0 0 1 0
The RDD will be of the sort
(userID, movID,rating)
101,1002,3.5
101,1003,2.5
101,1006,3
102,1002,3.5
102,1004,4.0
103,1002,1.0
103,1003,1.0
103,1007,5.0
….
答案 0 :(得分:0)
val baseRDD = sc.parallelize(Seq((101, 1002, 3.5), (101, 1003, 2.5), (101, 1006, 3), (102, 1002, 3.5), (102, 1004, 4.0), (103, 1002, 1.0), (103, 1003, 1.0), (103, 1007, 5.0)))
baseRDD.map(x => (x._1, x._2)).groupByKey().foreach(println)
(userID,movID,rating)格式,如您所述
<强>结果:强>
(101,CompactBuffer(1002,1003,1006))
(102,CompactBuffer(1002,1004))
(103,CompactBuffer(1002,1003,1007))
答案 1 :(得分:0)
我设法使用以下功能生成2D矩阵。它采用格式
的RDD((userID, movID),rating)
101,1002,3.5
101,1003,2.5
101,1006,3
102,1002,3.5
102,1004,4.0
103,1002,1.0
103,1003,1.0
103,1007,5.0
并返回特征矩阵:
def generate_characteristic_matrix(data_wo_header:RDD[((Int, Int), Int)]):Array[Array[Int]]={
val distinct_user_IDs=data_wo_header.sortByKey().map(x=>x._1._1).distinct().collect().sorted
val distinct_movie_IDs=data_wo_header.sortByKey().map(x=>x._1._2).distinct().collect().sorted
var movie_count=distinct_movie_IDs.size
var user_count=distinct_user_IDs.size
var a =0
var map_movie = new ArrayBuffer[(Int, Int)]()
var map_user = new ArrayBuffer[(Int, Int)]()
//map movie ID's from (0,movie_count)
for( a <- 0 to movie_count-1){
map_movie+=((distinct_movie_IDs(a),a))
}
//map user ID's from (0,user_count)
for( a <- 0 to user_count-1){
map_user+=((distinct_user_IDs(a),a))
}
//size of char matrix is user_countxmovie_count
var char_matrix = Array.ofDim[Int](user_count,movie_count)
data_wo_header.collect().foreach(x => {
var user =x._1._1
var movie=x._1._2
var movie_mappedid=map_movie.filter(x=>x._1==movie).map(x=>x._2).toArray
var user_mappedid=map_user.filter(x=>x._1==user).map(x=>x._2).toArray
char_matrix(user_mappedid(0))(movie_mappedid(0))=1
})
return char_matrix
}