作为一点点上下文,我要在此处实现的目标是将多行按特定的键集分组,然后先将其归类,然后按日期将它们分组到常规行中,并预先计算每个分组的计数器。仅仅阅读它可能似乎并不清楚,所以这里是一个示例输出(很简单,没什么复杂的事情)。
(("Volvo", "T4", "2019-05-01"), 5)
(("Volvo", "T5", "2019-05-01"), 7)
(("Audi", "RS6", "2019-05-01"), 4)
然后合并那些Row对象...
date , volvo_counter , audi_counter
"2019-05-01" , 12 , 4
我认为这是一个非常极端的情况,可能会有不同的方法,但是我想知道在同一RDD中是否有任何解决方案,因此不需要将多个RDD除以计数器。
答案 0 :(得分:2)
您想做的就是枢轴。您谈论的是RDD,所以我假设您的问题是:“如何使用RDD API进行数据透视?”。据我所知,RDD API中没有内置函数可以执行此操作。您可以这样自己做:
// let's create sample data
val rdd = sc.parallelize(Seq(
(("Volvo", "T4", "2019-05-01"), 5),
(("Volvo", "T5", "2019-05-01"), 7),
(("Audi", "RS6", "2019-05-01"), 4)
))
// If the keys are not known in advance, we compute their distinct values
val values = rdd.map(_._1._1).distinct.collect.toSeq
// values: Seq[String] = WrappedArray(Volvo, Audi)
// Finally we make the pivot and use reduceByKey on the sequence
val res = rdd
.map{ case ((make, model, date), counter) =>
date -> values.map(v => if(make == v) counter else 0)
}
.reduceByKey((a, b) => a.indices.map(i => a(i) + b(i)))
// which gives you this
res.collect.head
// (String, Seq[Int]) = (2019-05-01,Vector(12, 4))
请注意,您可以使用SparkSQL API编写更简单的代码:
// let's first transform the previously created RDD to a dataframe:
val df = rdd.map{ case ((a, b, c), d) => (a, b, c, d) }
.toDF("make", "model", "date", "counter")
// And then it's as simple as that:
df.groupBy("date")
.pivot("make")
.agg(sum("counter"))
.show
+----------+----+-----+
| date|Audi|Volvo|
+----------+----+-----+
|2019-05-01| 4| 12|
+----------+----+-----+
答案 1 :(得分:1)
我认为使用DataFrame更容易:
val data = Seq(
Record(Key("Volvo", "2019-05-01"), 5),
Record(Key("Volvo", "2019-05-01"), 7),
Record(Key("Audi", "2019-05-01"), 4)
)
val rdd = spark.sparkContext.parallelize(data)
val df = rdd.toDF()
val modelsExpr = df
.select("key.model").as("model")
.distinct()
.collect()
.map(r => r.getAs[String]("model"))
.map(m => sum(when($"key.model" === m, $"value").otherwise(0)).as(s"${m}_counter"))
df
.groupBy("key.date")
.agg(modelsExpr.head, modelsExpr.tail: _*)
.show(false)