Question

嗨我有两个RDD我希望合并为1。第一个RDD的格式为

//((UserID,MovID),Rating)
val predictions =
model.predict(user_mov).map { case Rating(user, mov, rate) =>
  ((user, mov), rate)
}

我有另一个RDD

//((UserID,MovID),"NA")
val user_mov_rat=user_mov.map(x=>(x,"N/A"))

所以第二个RDD中的键更多是没有。但与RDD1重叠。我需要组合RDD，以便只有第二个RDD的那些键附加到RDD1中不存在的RDD1。

Answer 1

你可以这样做 -

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col

// Setting up the rdds as described in the question
case class UserRating(user: String, mov: String, rate: Int = -1)

val list1 = List(UserRating("U1", "M1", 1),UserRating("U2", "M2", 3),UserRating("U3", "M1", 3),UserRating("U3", "M2", 1),UserRating("U4", "M2", 2))

val list2 = List(UserRating("U1", "M1"),UserRating("U5", "M4", 3),UserRating("U6", "M6"),UserRating("U3", "M2"), UserRating("U4", "M2"), UserRating("U4", "M3", 5))

val rdd1 = sc.parallelize(list1)
val rdd2 = sc.parallelize(list2)

// Convert to Dataframe so it is easier to handle    
val df1 = rdd1.toDF
val df2 = rdd2.toDF

// What we got:
df1.show
+----+---+----+
|user|mov|rate|
+----+---+----+
|  U1| M1|   1|
|  U2| M2|   3|
|  U3| M1|   3|
|  U3| M2|   1|
|  U4| M2|   2|
+----+---+----+

df2.show
+----+---+----+
|user|mov|rate|
+----+---+----+
|  U1| M1|  -1|
|  U5| M4|   3|
|  U6| M6|  -1|
|  U3| M2|  -1|
|  U4| M2|  -1|
|  U4| M3|   5|
+----+---+----+

// Figure out the extra reviews in second dataframe that do not match (user, mov) in first    
val xtraReviews = df2.join(df1.withColumnRenamed("rate", "rate1"), Seq("user", "mov"), "left_outer").where("rate1 is null")

// Union them. Be careful because of this: http://stackoverflow.com/questions/32705056/what-is-going-wrong-with-unionall-of-spark-dataframe

def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
    val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
    a.select(columns: _*).union(b.select(columns: _*))
}

// Final result of combining only unique values in df2    
unionByName(df1, xtraReviews).show

+----+---+----+
|user|mov|rate|
+----+---+----+
|  U1| M1|   1|
|  U2| M2|   3|
|  U3| M1|   3|
|  U3| M2|   1|
|  U4| M2|   2|
|  U5| M4|   3|
|  U4| M3|   5|
|  U6| M6|  -1|
+----+---+----+

Answer 2

也可以这样做：

RDD非常慢，因此请阅读您的数据或在数据框中转换数据。
在dropDuplicates()等数据框上使用spark df.dropDuplicates(['Key1', 'Key2'])，以获取两个数据框中键的不同值，然后
简单地将它们联合起来df1.union(df2)。

好处是你以火花的方式做到这一点，因此你拥有所有的并行性和速度。

将RDD与缺少的某些值组合在一起

2 个答案: