如何使用Spark计算累积和

时间:2016-02-02 13:01:23

标签: scala apache-spark

我有一个(String,Int)的rdd,按键排序

val data = Array(("c1",6), ("c2",3),("c3",4))
val rdd = sc.parallelize(data).sortByKey

现在我想要将第一个键的值设为零,将后续键作为前一个键的总和。

例如:c1 = 0,c2 = c1的值,c3 =(c1值+ c2值),c4 =(c1 + .. + c3值) 预期产出:

(c1,0), (c2,6), (c3,9)...

有可能实现这一目标吗? 我用地图尝试了它,但总和没有保留在地图内。

var sum = 0 ;
val t = keycount.map{ x => { val temp = sum; sum = sum + x._2 ; (x._1,temp); }}

5 个答案:

答案 0 :(得分:17)

  1. 计算每个分区的部分结果:

    val partials = rdd.mapPartitionsWithIndex((i, iter) => {
      val (keys, values) = iter.toSeq.unzip
      val sums  = values.scanLeft(0)(_ + _)
      Iterator((keys.zip(sums.tail), sums.last))
    })
    
  2. 收集部分金额

    val partialSums = partials.values.collect
    
  3. 计算分区的累积总和并广播它:

    val sumMap = sc.broadcast(
      (0 until rdd.partitions.size)
        .zip(partialSums.scanLeft(0)(_ + _))
        .toMap
    )
    
  4. 计算最终结果:

    val result = partials.keys.mapPartitionsWithIndex((i, iter) => {
      val offset = sumMap.value(i)
      if (iter.isEmpty) Iterator()
      else iter.next.map{case (k, v) => (k, v + offset)}.toIterator
    })
    

答案 1 :(得分:1)

这是PySpark中的解决方案。在内部它与@ zero323的Scala解决方案基本相同,但它提供了一个类似Spark的API的通用功能。

import numpy as np
def cumsum(rdd, get_summand):
    """Given an ordered rdd of items, computes cumulative sum of
    get_summand(row), where row is an item in the RDD.
    """
    def cumsum_in_partition(iter_rows):
        total = 0
        for row in iter_rows:
            total += get_summand(row)
            yield (total, row)
    rdd = rdd.mapPartitions(cumsum_in_partition)

    def last_partition_value(iter_rows):
        final = None
        for cumsum, row in iter_rows:
            final = cumsum
        return (final,)

    partition_sums = rdd.mapPartitions(last_partition_value).collect()
    partition_cumsums = list(np.cumsum(partition_sums))
    partition_cumsums = [0] + partition_cumsums
    partition_cumsums = sc.broadcast(partition_cumsums)

    def add_sums_of_previous_partitions(idx, iter_rows):
        return ((cumsum + partition_cumsums.value[idx], row)
            for cumsum, row in iter_rows)
    rdd = rdd.mapPartitionsWithIndex(add_sums_of_previous_partitions)
    return rdd

# test for correctness by summing numbers, with and without Spark
rdd = sc.range(10000,numSlices=10).sortBy(lambda x: x)
cumsums, values = zip(*cumsum(rdd,lambda x: x).collect())
assert all(cumsums == np.cumsum(values))

答案 2 :(得分:1)

我遇到了类似的问题并实施了@Paul的解决方案。我想在按键(整数)排序​​的整数频率表上执行cumsum,np.cumsum(partition_sums)存在一个小问题,错误为unsupported operand type(s) for +=: 'int' and 'NoneType'

因为如果范围足够大,那么每个分区具有某些东西的概率就足够大(没有None值)。但是,如果范围远小于count,并且分区数保持不变,则某些分区将为空。以下是修改后的解决方案:

def cumsum(rdd, get_summand):
    """Given an ordered rdd of items, computes cumulative sum of
    get_summand(row), where row is an item in the RDD.
    """
    def cumsum_in_partition(iter_rows):
        total = 0
        for row in iter_rows:
            total += get_summand(row)
            yield (total, row)
    rdd = rdd.mapPartitions(cumsum_in_partition)
    def last_partition_value(iter_rows):
        final = None
        for cumsum, row in iter_rows:
            final = cumsum
        return (final,)
    partition_sums = rdd.mapPartitions(last_partition_value).collect()
    # partition_cumsums = list(np.cumsum(partition_sums))

    #----from here are the changed lines
    partition_sums = [x for x in partition_sums if x is not None] 
    temp = np.cumsum(partition_sums)
    partition_cumsums = list(temp)
    #----

    partition_cumsums = [0] + partition_cumsums   
    partition_cumsums = sc.broadcast(partition_cumsums)
    def add_sums_of_previous_partitions(idx, iter_rows):
        return ((cumsum + partition_cumsums.value[idx], row)
            for cumsum, row in iter_rows)
    rdd = rdd.mapPartitionsWithIndex(add_sums_of_previous_partitions)
    return rdd

#test on random integer frequency
x = np.random.randint(10, size=1000)
D = sqlCtx.createDataFrame(pd.DataFrame(x.tolist(),columns=['D']))
c = D.groupBy('D').count().orderBy('D')
c_rdd =  c.rdd.map(lambda x:x['count'])
cumsums, values = zip(*cumsum(c_rdd,lambda x: x).collect())

答案 3 :(得分:1)

Spark为hive ANALYTICS / WINDOWING 函数提供了buit-in支持,使用ANALYTICS函数可以轻松实现累积和。

Hive wiki ANALYTICS/WINDOWING函数。

示例:

假设你有sqlContext对象 -

val datardd = sqlContext.sparkContext.parallelize(Seq(("a",1),("b",2), ("c",3),("d",4),("d",5),("d",6)))
import sqlContext.implicits._

//Register as test table
datardd.toDF("id","val").createOrReplaceTempView("test")

//Calculate Cumulative sum
sqlContext.sql("select id,val, " +
  "SUM(val) over (  order by id  rows between unbounded preceding and current row ) cumulative_Sum " +
  "from test").show()

此方法导致低于警告。如果执行程序运行outOfMemory,则相应地调整作业的内存参数以使用大数据集。

  

WARN WindowExec:没有为Window操作定义分区!移动   将所有数据转移到单个分区,这可能会导致严重的性能   降解

我希望这会有所帮助。

答案 4 :(得分:-1)

您可以尝试使用rowsBetween来尝试Windows。希望仍然有用。

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val data = Array(("c1",6), ("c2",3),("c3",4))
val df = sc.parallelize(data).sortByKey().toDF("c", "v")
val w = Window.orderBy("c")
val r = df.select( $"c", sum($"v").over(w.rowsBetween(-2, -1)).alias("cs"))
display(r)