Scala广播加入"一对多"关系

时间:2018-03-15 13:24:16

标签: scala join broadcast

我是Scala和RDD的新手。 我有一个非常简单的场景,但似乎很难用RDD实现。

方案: 我有两张桌子。一大一小。我播放了较小的桌子。 然后我想加入表格,最后在连接后将值汇总到最终总数。

以下是代码示例:

val bigRDD = sc.parallelize(List(("A",1,"1Jan2000"),("B",2,"1Jan2000"),("C",3,"1Jan2000"),("D",3,"1Jan2000"),("E",3,"1Jan2000")))
val smallRDD = sc.parallelize(List(("A","Fruit","Apples"),("A","ZipCode","1234"),("B","Fruit","Apples"),("B","ZipCode","456")))

val broadcastVar = sc.broadcast(smallRDD.keyBy{ a => (a._1,a._2) }  // turn to pair RDD
  .collectAsMap()               // collect as Map
)

//first join
val joinedRDD = bigRDD.map( accs => {
  //get list of groups
  val groups = List("Fruit", "ZipCode")
  val i = "Fruit"
  //for each group
  //for(i <- groups) {
  if (broadcastVar.value.get(accs._1, i) != None) {
    ( broadcastVar.value.get(accs._1, i).get._1,
      broadcastVar.value.get(accs._1, i).get._2,
      accs._2, accs._3)
  } else {
    None
  }
  //}
}
)
//expected after this
//("A","Fruit","Apples",1, "1Jan2000"),("B","Fruit","Apples",2, "1Jan2000"),
//("A","ZipCode","1234", 1,"1Jan2000"),("B","ZipCode","456", 2,"1Jan2000")

//then group and sum
//cannot do anything with the joinedRDD!!!
//error == value copy is not a member of Product with Serializable

// Final Expected Result
//("Fruit","Apples",3, "1Jan2000"),("ZipCode","1234", 1,"1Jan2000"),("ZipCode","456", 2,"1Jan2000")

我的问题:

  • 首先使用RDD这是最好的方法吗? 免责声明 - 我已成功使用数据框完成了这项任务。我们的想法是仅使用RDD创建另一个版本来比较性能。
  • 为什么我的joinRDD的类型在创建后无法识别,以便我可以继续使用像copy这样的函数?
  • 如何在广播变量时​​不使用.collectAsMap()。我目前必须包括第一个项目来强制执行唯一性而不删除任何值。

提前感谢您的帮助!

感兴趣的人的最终解决方案

case class dt (group:String, group_key:String, count:Long, date:String)

val bigRDD = sc.parallelize(List(("A",1,"1Jan2000"),("B",2,"1Jan2000"),("C",3,"1Jan2000"),("D",3,"1Jan2000"),("E",3,"1Jan2000")))
val smallRDD = sc.parallelize(List(("A","Fruit","Apples"),("A","ZipCode","1234"),("B","Fruit","Apples"),("B","ZipCode","456")))

val broadcastVar = sc.broadcast(smallRDD.keyBy{ a => (a._1) }  // turn to pair RDD
    .groupByKey()                 //to not loose any data
    .collectAsMap()               // collect as Map
)

//first join
val joinedRDD = bigRDD.flatMap( accs => {
  if (broadcastVar.value.get(accs._1) != None) {
  val bc = broadcastVar.value.get(accs._1).get
    bc.map(p => {
      dt(p._2, p._3,accs._2, accs._3)
    })
  } else {
    None
  }
}
)
//expected after this
//("Fruit","Apples",1, "1Jan2000"),("Fruit","Apples",2, "1Jan2000"),
//("ZipCode","1234", 1,"1Jan2000"),("ZipCode","456", 2,"1Jan2000")

//then group and sum
var finalRDD = joinedRDD.map(s => {
  (s.copy(count=0),s.count)  //trick to keep code to minimum (count = 0)
  })
  .reduceByKey(_ + _)
  .map(pair => {
    pair._1.copy(count=pair._2)
  })

1 个答案:

答案 0 :(得分:1)

在map语句中,根据if条件返回元组或None。这些类型不匹配,因此您退回常见的超类型,因此joinedRDDRDD[Product with Serializable]这根本不是您想要的(它基本上是RDD[Any])。您需要确保所有路径返回相同的类型。在这种情况下,您可能需要Option[(String, String, Int, String)]。您需要做的就是将元组结果包装到Some

  if (broadcastVar.value.get(accs._1, i) != None) {
    Some(( broadcastVar.value.get(accs._1, i).get.group_key,
      broadcastVar.value.get(accs._1, i).get.group,
      accs._2, accs._3))
  } else {
    None
  }

现在您的类型将匹配。这将使joinedRDDRDD[Option(String, String, Int, String)]成为可能。现在类型是正确的,数据是可用的,但是,这意味着您需要映射选项以使用元组。如果您在最终结果中不需要None值,则可以使用flatmap代替map来创建joinedRDD,这将为您打开选项,过滤掉所有None s。

CollectAsMap是将RDD转换为Hashmap的正确方法,但您需要为单个键设置多个值。在使用collectAsMap之前,但在将smallRDD映射到Key,Value对之后,使用groupByKey将单个键的所有值组合在一起。当您从HashMap中查找键时,可以映射值,为每个键创建一个新记录。