合并两个rdds

时间:2017-04-11 12:18:20

标签: scala apache-spark

我是新来的火花,有人可以帮我找到一种方法来组合两个rdd来创建一个最终的rdd,根据scala中的以下逻辑,最好不使用sqlcontext(dataframes) -

RDD1 = column1,column2,column3,包含362825条记录

RDD2 = column2_distinct(与RDD1相同,但包含不同的值),column4有2621条记录

最终RDD = column1,column2,column3,column4

例 - RDD1 =

userid | progid |评分

       a       001     5
       b       001     3
       b       002     4
       c       003     2

RDD2 =

progid(distinct)| ID

   001                  1
   002                  2
   003                  3

最终RDD =

userid | progid | id |评价

        a       001      1   5
        b       001      1   3
        b       002      2   4
        c       003      3   2

val rawRdd1 = pairrdd1.map(x => x._1.split(",")(0) + "," + x._1.split(",")(1) + "," + x._2) //362825 records

val rawRdd2 = pairrdd2.map(x => x._1 + "," + x._2) //2621 records

val schemaString1 = "userid programid rating"

val schemaString2 = "programid id"

val fields1 = schemaString1.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))

val fields2 = schemaString2.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))

val schema1 = StructType(fields1) val schema2 = StructType(fields2)

val rowRDD1 = rawRdd1.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1), attributes(2)))

val rowRDD2 = rawRdd2.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1)))

val DF1 = sparkSession.createDataFrame(rowRDD1, schema1)

val DF2 = sparkSession.createDataFrame(rowRDD2, schema2)

DF1.createOrReplaceTempView("df1")

DF2.createOrReplaceTempView("df2")

val resultDf = DF1.join(DF2, Seq("programid"))

val DF3 = sparkSession.sql("""SELECT df1.userid, df1.programid, df2.id, df1.rating FROM df1 JOIN df2 on df1.programid == df2.programid""")

println(DF1.count()) //362825 records

println(DF2.count()) //2621 records

println(DF3.count()) //only 297 records- expecting same number of records as DF1 with a new column attached from DF2 (id) having corresponding value of programid from DF2

2 个答案:

答案 0 :(得分:0)

这有点难看但应该有效(Spark 2.0):

 val rdd1 = sparkSession.sparkContext.parallelize(List("a,001,5", "b,001,3", "b,002,4","c,003,2"))
 val rdd2 = sparkSession.sparkContext.parallelize(List("001,1", "002,2", "003,3"))

 val groupedRDD1 = rdd1.map(x => (x.split(",")(1),x))
 val groupedRDD2 = rdd2.map(x => (x.split(",")(0),x))
 val joinRDD = groupedRDD1.join(groupedRDD2)
 // convert back to String
 val cleanJoinRDD = joinRDD.map(x => x._1 + "," + x._2._1.replace(x._1 + ",","") + "," + x._2._2.replace(x._1 + ",",""))
 cleanJoinRDD.collect().foreach(println)

我认为更好的选择是使用spark SQL

答案 1 :(得分:0)

首先,为什么要拆分,连接并再次拆分该行?你可以一步完成:

val rowRdd1 = pairrdd1.map{x => 
    val (userid, progid) = x._1.split(",") 
    val rating = x._2
    Row(userid, progid, rating) 
}

我猜您的问题可能是您的密钥中有一些其他字符,因此在连接中不匹配。一个简单的方法是执行left join并检查它们不匹配的行。

这可能就像行中的额外空间一样,你可以为这两个rdds修复这个:

val rowRdd1 = pairrdd1.map{x =>  
    val (userid, progid) = x._1.split(",").map(_.trim)
    val rating = x._2
    Row(userid, progid, rating) 
}