我是Scala和RDD的新手。 我有一个非常简单的场景,但似乎很难用RDD实现。
方案: 我有两张桌子。一大一小。我播放了较小的桌子。 然后我想加入表格,最后在连接后将值汇总到最终总数。
以下是代码示例:
val bigRDD = sc.parallelize(List(("A",1,"1Jan2000"),("B",2,"1Jan2000"),("C",3,"1Jan2000"),("D",3,"1Jan2000"),("E",3,"1Jan2000")))
val smallRDD = sc.parallelize(List(("A","Fruit","Apples"),("A","ZipCode","1234"),("B","Fruit","Apples"),("B","ZipCode","456")))
val broadcastVar = sc.broadcast(smallRDD.keyBy{ a => (a._1,a._2) } // turn to pair RDD
.collectAsMap() // collect as Map
)
//first join
val joinedRDD = bigRDD.map( accs => {
//get list of groups
val groups = List("Fruit", "ZipCode")
val i = "Fruit"
//for each group
//for(i <- groups) {
if (broadcastVar.value.get(accs._1, i) != None) {
( broadcastVar.value.get(accs._1, i).get._1,
broadcastVar.value.get(accs._1, i).get._2,
accs._2, accs._3)
} else {
None
}
//}
}
)
//expected after this
//("A","Fruit","Apples",1, "1Jan2000"),("B","Fruit","Apples",2, "1Jan2000"),
//("A","ZipCode","1234", 1,"1Jan2000"),("B","ZipCode","456", 2,"1Jan2000")
//then group and sum
//cannot do anything with the joinedRDD!!!
//error == value copy is not a member of Product with Serializable
// Final Expected Result
//("Fruit","Apples",3, "1Jan2000"),("ZipCode","1234", 1,"1Jan2000"),("ZipCode","456", 2,"1Jan2000")
我的问题:
提前感谢您的帮助!
case class dt (group:String, group_key:String, count:Long, date:String)
val bigRDD = sc.parallelize(List(("A",1,"1Jan2000"),("B",2,"1Jan2000"),("C",3,"1Jan2000"),("D",3,"1Jan2000"),("E",3,"1Jan2000")))
val smallRDD = sc.parallelize(List(("A","Fruit","Apples"),("A","ZipCode","1234"),("B","Fruit","Apples"),("B","ZipCode","456")))
val broadcastVar = sc.broadcast(smallRDD.keyBy{ a => (a._1) } // turn to pair RDD
.groupByKey() //to not loose any data
.collectAsMap() // collect as Map
)
//first join
val joinedRDD = bigRDD.flatMap( accs => {
if (broadcastVar.value.get(accs._1) != None) {
val bc = broadcastVar.value.get(accs._1).get
bc.map(p => {
dt(p._2, p._3,accs._2, accs._3)
})
} else {
None
}
}
)
//expected after this
//("Fruit","Apples",1, "1Jan2000"),("Fruit","Apples",2, "1Jan2000"),
//("ZipCode","1234", 1,"1Jan2000"),("ZipCode","456", 2,"1Jan2000")
//then group and sum
var finalRDD = joinedRDD.map(s => {
(s.copy(count=0),s.count) //trick to keep code to minimum (count = 0)
})
.reduceByKey(_ + _)
.map(pair => {
pair._1.copy(count=pair._2)
})
答案 0 :(得分:1)
在map语句中,根据if条件返回元组或None。这些类型不匹配,因此您退回常见的超类型,因此joinedRDD
是RDD[Product with Serializable]
这根本不是您想要的(它基本上是RDD[Any]
)。您需要确保所有路径返回相同的类型。在这种情况下,您可能需要Option[(String, String, Int, String)]
。您需要做的就是将元组结果包装到Some
if (broadcastVar.value.get(accs._1, i) != None) {
Some(( broadcastVar.value.get(accs._1, i).get.group_key,
broadcastVar.value.get(accs._1, i).get.group,
accs._2, accs._3))
} else {
None
}
现在您的类型将匹配。这将使joinedRDD
和RDD[Option(String, String, Int, String)]
成为可能。现在类型是正确的,数据是可用的,但是,这意味着您需要映射选项以使用元组。如果您在最终结果中不需要None
值,则可以使用flatmap
代替map
来创建joinedRDD
,这将为您打开选项,过滤掉所有None
s。
CollectAsMap
是将RDD转换为Hashmap的正确方法,但您需要为单个键设置多个值。在使用collectAsMap
之前,但在将smallRDD映射到Key,Value对之后,使用groupByKey
将单个键的所有值组合在一起。当您从HashMap中查找键时,可以映射值,为每个键创建一个新记录。