我有一个DataFrame,我想得到相邻数据的总和,我使用窗口函数,但我发现当我使用窗口函数时,所有的数据都收集在一个分区中。如何得到相邻的总和数据框在多个分区中的数据? 这是我的代码:
val arr = Array(1, 7, 3, 3, 5,21, 7, 3, 9, 10)
var df = sc.parallelize(arr,5).toDF("value")
val w=Window.rowsBetween(-1,0)
df= df.withColumn("nextValue",first(col("value")).over(w)).withColumn("sum",col("value")+col("nextValue"))
println(df.rdd.getNumPartitions)
df.show()
//get the data'number of each partition
df.rdd.mapPartitionsWithIndex{
(partIdx,iter) => {
var part_map = scala.collection.mutable.Map[String,Int]()
while(iter.hasNext){
var part_name = "part_" + partIdx;
if(part_map.contains(part_name)) {
var ele_cnt = part_map(part_name)
part_map(part_name) = ele_cnt + 1
} else {
part_map(part_name) = 1
}
iter.next()
}
part_map.iterator
}
}.collect.foreach(println)
这是我期望的结果:
+-----+---------+---+
|value|nextValue|sum|
+-----+---------+---+
| 1| 1| 2|
| 7| 1| 8|
| 3| 7| 10|
| 3| 3| 6|
| 5| 3| 8|
| 21| 5| 26|
| 7| 21| 28|
| 3| 7| 10|
| 9| 3| 12|
| 10| 9| 19|
+-----+---------+---+
答案 0 :(得分:0)
我使用sliding
:
import org.apache.spark.mllib.rdd.RDDFunctions._
df.as[Int].rdd.sliding(2).map(_.sum).toDF
答案 1 :(得分:0)
如果可能,您可以尝试使用专栏:
import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions._ var df = sc.parallelize(List(1, 7, 3, 3, 5, 21, 7, 3, 9, 10).zipWithIndex, 5).toDF("value", "id") df=df.withColumn("nextValue",first(df("value")).over(Window.orderBy("id").rowsBetween(-1,0))) df=df.withColumn("sum",df("value")+df("nextValue")) df.select("value", "nextValue", "sum").show()
结果:
+-----+---------+---+ |value|nextValue|sum| +-----+---------+---+ | 1| 1| 2| | 7| 1| 8| | 3| 7| 10| | 3| 3| 6| | 5| 3| 8| | 21| 5| 26| | 7| 21| 28| | 3| 7| 10| | 9| 3| 12| | 10| 9| 19| +-----+---------+---+