将RDD与缺少的某些值组合在一起

时间:2017-03-10 12:02:12

标签: scala apache-spark rdd

嗨我有两个RDD我希望合并为1。 第一个RDD的格式为

//((UserID,MovID),Rating)
val predictions =
model.predict(user_mov).map { case Rating(user, mov, rate) =>
  ((user, mov), rate)
}

我有另一个RDD

//((UserID,MovID),"NA")
val user_mov_rat=user_mov.map(x=>(x,"N/A"))

所以第二个RDD中的键更多是没有。但与RDD1重叠。我需要组合RDD,以便只有第二个RDD的那些键附加到RDD1中不存在的RDD1。

2 个答案:

答案 0 :(得分:0)

你可以这样做 -

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col

// Setting up the rdds as described in the question
case class UserRating(user: String, mov: String, rate: Int = -1)

val list1 = List(UserRating("U1", "M1", 1),UserRating("U2", "M2", 3),UserRating("U3", "M1", 3),UserRating("U3", "M2", 1),UserRating("U4", "M2", 2))

val list2 = List(UserRating("U1", "M1"),UserRating("U5", "M4", 3),UserRating("U6", "M6"),UserRating("U3", "M2"), UserRating("U4", "M2"), UserRating("U4", "M3", 5))

val rdd1 = sc.parallelize(list1)
val rdd2 = sc.parallelize(list2)

// Convert to Dataframe so it is easier to handle    
val df1 = rdd1.toDF
val df2 = rdd2.toDF

// What we got:
df1.show
+----+---+----+
|user|mov|rate|
+----+---+----+
|  U1| M1|   1|
|  U2| M2|   3|
|  U3| M1|   3|
|  U3| M2|   1|
|  U4| M2|   2|
+----+---+----+

df2.show
+----+---+----+
|user|mov|rate|
+----+---+----+
|  U1| M1|  -1|
|  U5| M4|   3|
|  U6| M6|  -1|
|  U3| M2|  -1|
|  U4| M2|  -1|
|  U4| M3|   5|
+----+---+----+

// Figure out the extra reviews in second dataframe that do not match (user, mov) in first    
val xtraReviews = df2.join(df1.withColumnRenamed("rate", "rate1"), Seq("user", "mov"), "left_outer").where("rate1 is null")

// Union them. Be careful because of this: http://stackoverflow.com/questions/32705056/what-is-going-wrong-with-unionall-of-spark-dataframe

def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
    val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
    a.select(columns: _*).union(b.select(columns: _*))
}

// Final result of combining only unique values in df2    
unionByName(df1, xtraReviews).show

+----+---+----+
|user|mov|rate|
+----+---+----+
|  U1| M1|   1|
|  U2| M2|   3|
|  U3| M1|   3|
|  U3| M2|   1|
|  U4| M2|   2|
|  U5| M4|   3|
|  U4| M3|   5|
|  U6| M6|  -1|
+----+---+----+

答案 1 :(得分:0)

也可以这样做:

  1. RDD非常慢,因此请阅读您的数据或在数据框中转换数据。
  2. dropDuplicates()等数据框上使用spark df.dropDuplicates(['Key1', 'Key2']),以获取两个数据框中键的不同值,然后
  3. 简单地将它们联合起来df1.union(df2)
  4. 好处是你以火花的方式做到这一点,因此你拥有所有的并行性和速度。